Prefix preserving hash function - hash

I am looking for a hash function f() whose outputs can preserve the prefix of the inputs. The detailed requirements are as followings.
f() takes variable-length bit strings as input and outputs bit strings;
assume a and b are bit strings and a is a substring of b, then f(a) is also a substring of f(b);
the length of the output bit string should be smaller than the input bit string.
Any idea?

There will be no such hash function that meets your criterion.
Suppose you have such hash function Hash that preserves prefix, then answer these questions:
(1) Hash("a") =? It could be anything, right?
(2) What about Hash("xa")=? to preserve the prefix, it has to be
|Hash("xa")-Hash("a")| + Hash("a")
(3) What about Hash("yxa")=? similarly as (2), it has to be
|Hash("yxa")-Hash("xa")| + |Hash("xa")-Hash("a")| + Hash("a")
So the hash will always have longer lengh than the original.

Related

How are scalars stored 'under the hood' in perl?

The basic types in perl are different then most languages, with types being scalar, array, hash (but apparently not subroutines, &, which I guess are really just scalar references with syntactical sugar). What is most odd about this is that the most common data types: int, boolean, char, string, all fall under the basic data type "scalar". It seems that perl decides rather to treat a scalar as a string, boolean, or number based off of the operator that modifies it, implying the scalar itself is not actually defined as "int" or "String" when saved.
This makes me curious as to how these scalars are stored "under the hood", particularly in regards to it's effect on efficiency (yes I know scripting languages sacrifice efficiency for flexibility, but they still need to be as optimized as possible when flexibility concerns are not affected). It's much easier for me to store the number 65535 (which takes two bytes) then the string "65535" which takes 6 bytes, as such recognizing that $val = 65535 is storing an int would allow me to use 1/3 the memory, in large arrays this could mean fewer cache hits as well.
It's not just limited to saving memory of course. There are times when I can offer more significant optimizations if I know what type of scalar to expect. For instance if I have a hash using very large integers as keys it would be far faster to look up a value if I recognizing the keys as ints, allowing a simply modulo for creating my hash key, then if I have to run more complex hashing logic on a string that has 3 times the bytes.
So I'm wondering how perl handles these scalars under the hood. Does it store every value as a string, sacrificing the extra memory and cpu cost of constant converting string to int in the case that a scalar is always used as an int? Or does it have some logic for inference the type of scalar used to determine how to save and manipulate it?
Edit:
TJD linked to perlguts, which answers half my question. A scalar is actually stored as string, int (signed, unsigned, double) or pointer. I'm not too surprised, I had mostly expected this behavior to occur under the hood, though it's interesting to see the exact types. I'm leaving this question open though because perlguts is actually to low level. Other then telling me that 5 data types exist it doesn't specify how perl works to alternate between them, ie how perl decides which SV type to use when a scalar is saved and how it knows when/how to cast.
There are actually a number of types of scalars. A scalar of type SVt_IV can hold undef, a signed integer (IV) or an unsigned integer (UV). One of type SVt_PVIV can also hold a string[1]. Scalars are silently upgraded from one type to another as needed[2]. The TYPE field indicates the type of a scalar. In fact, arrays (SVt_AV) and hashes (SVt_HV) are really just types of scalars.
While the type of a scalar indicates what the scalar can contain, flags are used to indicate what a scalar does contain. This is stored in the FLAGS field. SVf_IOK signals that a scalar contains a signed integer, while SVf_POK indicates it contains a string[3].
Devel::Peek's Dump is a great tool for looking at the internals of scalars. (The constant prefixes SVt_ and SVf_ are omitted by Dump.)
$ perl -e'
use Devel::Peek qw( Dump );
my $x = 123;
Dump($x);
$x = "456";
Dump($x);
$x + 0;
Dump($x);
'
SV = IV(0x25f0d20) at 0x25f0d30 <-- SvTYPE(sv) == SVt_IV, so it can contain an IV.
REFCNT = 1
FLAGS = (IOK,pIOK) <-- IOK: Contains an IV.
IV = 123 <-- The contained signed integer (IV).
SV = PVIV(0x25f5ce0) at 0x25f0d30 <-- The SV has been upgraded to SVt_PVIV
REFCNT = 1 so it can also contain a string now.
FLAGS = (POK,IsCOW,pPOK) <-- POK: Contains a string (but no IV since !IOK).
IV = 123 <-- Meaningless without IOK.
PV = 0x25f9310 "456"\0 <-- The contained string.
CUR = 3 <-- Number of bytes used by PV (not incl \0).
LEN = 10 <-- Number of bytes allocated for PV.
COW_REFCNT = 1
SV = PVIV(0x25f5ce0) at 0x25f0d30
REFCNT = 1
FLAGS = (IOK,POK,IsCOW,pIOK,pPOK) <-- Now contains both a string (POK) and an IV (IOK).
IV = 456 <-- This will be used in numerical contexts.
PV = 0x25f9310 "456"\0 <-- This will be used in string contexts.
CUR = 3
LEN = 10
COW_REFCNT = 1
illguts documents the internal format of variables quite thoroughly, but perlguts might be a better place to start.
If you start writing XS code, keep in mind it's usually a bad idea to check what a scalar contains. Instead, you should request what should have been provided (e.g. using SvIV or SvPVutf8). Perl will automatically convert the value to the requested type (and warn if appropriate). API calls are documented in perlapi.
In fact, it can hold a string an either a signed integer or an unsigned integer at the same time.
All scalars (including arrays and hashes, excluding one type of scalar that can only hold undef) have two memory blocks at their base. Pointers to the scalar point to its head, which contains the TYPE field and a pointer to the body. Upgrading a scalar replaces the body of the scalar. That way, pointers to the scalar aren't invalidated by an upgrade.
An undef variable is one without any uppercase OK flags set.
The formats used by Perl for data storage are documented in the perlguts perldoc.
In short, though, a Perl scalar is stored as a SV structure containing one of a number of different types, such as an int, a double, a char *, or a pointer to another scalar. (These types are stored as a C union, so only one of them will be present at a time; the SV contains flags indicating which type is used.)
(With regard to hash keys, there's an important gotcha to note there: hash keys are always strings, and are always stored as strings. They're stored in a different type from other scalars.)
The Perl API includes a number of functions which can be used to access the value of a scalar as a desired C type. For example, SvIV() can be used to return the integer value of a SV: if the SV contains an int, that value is returned directly; if the SV contains another type, it's coerced to an integer as appropriate. These functions are used throughout the Perl interpreter for type conversions. However, there is no automatic inference of types on output; functions which operate on strings will always return a PV (string) scalar, for instance, regardless of whether the string "looks like" a number or not.
If you're curious what a given scalar looks like internally, you can use the Devel::Peek module to dump its contents.
Others have addressed the "how are scalars stored" part of your question, so I'll skip that. With regard to how Perl decides which representation of a value to use and when to convert between them, the answer is it depends on which operators are applied to the scalar. For example, given this code:
my $score = 0;
The scalar $score will be initialised with an integer value. But then when this line of code is run:
say "Your score is $score";
The double quote operator means that Perl will need a string representation of the value. So the conversion from integer to string will take place as part of the process of assembling the string argument to the say function. Interestingly, after the stringification of $score, the underlying representation of the scalar will now include both an integer and a string representation, allowing subsequent operations to directly grab the relevant value without having to convert again. If a numeric operator is then applied to the string (e.g.: $score++) then the numeric part will be updated and the (now invalid) string part will be discarded.
This is the reason why Perl operators tend to come in two flavours. For example comparing values of numbers is done with <, ==, > while performing the same comparisons with strings would be done with lt, eq, gt. Perl will coerce the value of the scalar(s) to the type which matches the operator. This is why the + operator does numeric addition in Perl but a separate operator . is needed to do string concatenation: + will coerce its arguments to numeric values and . will coerce to strings.
There are some operators that will work with both numeric and string values but which perform a different operation depending on the type of value. For example:
$score = 0;
say ++$score; # 1
say ++$score; # 2
say ++$score; # 3
$score = 'aaa';
say ++$score; # 'aaa'
say ++$score; # 'aab'
say ++$score; # 'aac'
With regard to questions of efficiency (and bearing in mind standard disclaimers about premature optimisation etc). Consider this code which reads a file containing one integer per line, each integer is validated to check it is exactly 8 digits long and the valid ones are stored in an array:
my #numbers;
while(<$fh>) {
if(/^(\d{8})$/) {
push #numbers, $1;
}
}
Any data read from a file will initially come to us as a string. The regex used to validate the data will also require a string value in $_. So the result is that our array #numbers will contain a list of strings. However, if further uses of the values will be solely in a numeric context, we could use this micro-optimisation to ensure that the array contained only numeric values:
push #numbers, 0 + $1;
In my tests with a file of 10,000 lines, populating #numbers with strings used nearly three times as much memory as populating with integer values. However as with most benchmarks, this has little relevance to normal day-to-day coding in Perl. You'd only need to worry about that in situations where you a) had performance or memory issues and b) were working with a large number of values.
It's worth pointing out that some of this behaviour is common to other dynamic languages (e.g.: Javascript will silently coerce numeric values to strings).

case and white space insensitive hashing function

I am looking for a hashing function that is case insensitive and ignores white spaces as well.
for example:
the hash value generated for this is a hash and ThisIsAHash will be exactly the same.
does any such hash function exist?
Hash Functions are how we make them. For example:
First, for all strings ->
Step1. Lowercase them (or Uppercase them)
Step2. Strip all Whitespaces.
By now, both strings would map to: thisisahash
Step3. Now, apply any Hash function to it: crc32, java's polynomial or whatever...
Given a string, you can always now do a lookup and see if other Strings are hashed to the same key.
Note that hash functions are one-way. So doing Step1 and Step2 don't count against valid hash methods.

How does this Perl one-liner actually work?

So, I happened to notice that last.fm is hiring in my area, and since I've known a few people who worked there, I though of applying.
But I thought I'd better take a look at the current staff first.
Everyone on that page has a cute/clever/dumb strapline, like "Is life not a thousand times too short for us to bore ourselves?". In fact, it was quite amusing, until I got to this:
perl -e'print+pack+q,c*,,map$.+=$_,74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21, 18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34'
Which I couldn't resist pasting into my terminal (kind of a stupid thing to do, maybe), but it printed:
Just another Last.fm hacker,
I thought it would be relatively easy to figure out how that Perl one-liner works. But I couldn't really make sense of the documentation, and I don't know Perl, so I wasn't even sure I was reading the relevant documentation.
So I tried modifying the numbers, which got me nowhere. So I decided it was genuinely interesting and worth figuring out.
So, 'how does it work' being a bit vague, my question is mainly,
What are those numbers? Why are there negative numbers and positive numbers, and does the negativity or positivity matter?
What does the combination of operators +=$_ do?
What's pack+q,c*,, doing?
This is a variant on “Just another Perl hacker”, a Perl meme. As JAPHs go, this one is relatively tame.
The first thing you need to do is figure out how to parse the perl program. It lacks parentheses around function calls and uses the + and quote-like operators in interesting ways. The original program is this:
print+pack+q,c*,,map$.+=$_,74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21, 18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34
pack is a function, whereas print and map are list operators. Either way, a function or non-nullary operator name immediately followed by a plus sign can't be using + as a binary operator, so both + signs at the beginning are unary operators. This oddity is described in the manual.
If we add parentheses, use the block syntax for map, and add a bit of whitespace, we get:
print(+pack(+q,c*,,
map{$.+=$_} (74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21,
18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34)))
The next tricky bit is that q here is the q quote-like operator. It's more commonly written with single quotes:
print(+pack(+'c*',
map{$.+=$_} (74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21,
18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34)))
Remember that the unary plus is a no-op (apart from forcing a scalar context), so things should now be looking more familiar. This is a call to the pack function, with a format of c*, meaning “any number of characters, specified by their number in the current character set”. An alternate way to write this is
print(join("", map {chr($.+=$_)} (74, …, -34)))
The map function applies the supplied block to the elements of the argument list in order. For each element, $_ is set to the element value, and the result of the map call is the list of values returned by executing the block on the successive elements. A longer way to write this program would be
#list_accumulator = ();
for $n in (74, …, -34) {
$. += $n;
push #list_accumulator, chr($.)
}
print(join("", #list_accumulator))
The $. variable contains a running total of the numbers. The numbers are chosen so that the running total is the ASCII codes of the characters the author wants to print: 74=J, 74+43=117=u, 74+43-2=115=s, etc. They are negative or positive depending on whether each character is before or after the previous one in ASCII order.
For your next task, explain this JAPH (produced by EyesDrop).
''=~('(?{'.('-)#.)#_*([]#!#/)(#)#-#),#(##+#)'
^'][)#]`}`]()`#.#]#%[`}%[#`#!##%[').',"})')
Don't use any of this in production code.
The basic idea behind this is quite simple. You have an array containing the ASCII values of the characters. To make things a little bit more complicated you don't use absolute values, but relative ones except for the first one. So the idea is to add the specific value to the previous one, for example:
74 -> J
74 + 43 -> u
74 + 42 + (-2 ) -> s
Even though $. is a special variable in Perl it does not mean anything special in this case. It is just used to save the previous value and add the current element:
map($.+=$_, ARRAY)
Basically it means add the current list element ($_) to the variable $.. This will return a new array with the correct ASCII values for the new sentence.
The q function in Perl is used for single quoted, literal strings. E.g. you can use something like
q/Literal $1 String/
q!Another literal String!
q,Third literal string,
This means that pack+q,c*,, is basically pack 'c*', ARRAY. The c* modifier in pack interprets the value as characters. For example, it will use the value and interpret it as a character.
It basically boils down to this:
#!/usr/bin/perl
use strict;
use warnings;
my $prev_value = 0;
my #relative = (74,43,-2,1,-84, 65,13,1,5,-12,-3, 13,-82,44,21, 18,1,-70,56, 7,-77,72,-7,2, 8,-6,13,-70,-34);
my #absolute = map($prev_value += $_, #relative);
print pack("c*", #absolute);

Perl autoincrement of string not working as before

I have some code where I am converting some data elements in a flat file. I save the old:new values to a hash which is written to a file at the end of processing. On subsequence execution, I reload into a hash so I can reuse previously converted values on additional data files. I also save the last conversion value so if I encounter an unconverted value, I can assign it a new converted value and add it to the hash.
I had used this code before (back in Feb) on six files with no issues. I have a variable that is set to ZCKL0 (last character is a zero) which is retrieved from a file holding the last used value. I apply the increment operator
...
$data{$olddata} = ++$dataseed;
...
and the resultant value in $dataseed is 1 instead of ZCKL1. The original starting seed value was ZAAA0.
What am I missing here?
Do you use the $dataseed variable in a numeric context in your code?
From perlop:
If you increment a variable that is
numeric, or that has ever been used in
a numeric context, you get a normal
increment. If, however, the variable
has been used in only string contexts
since it was set, and has a value that
is not the empty string and matches
the pattern /^[a-zA-Z][0-9]\z/ , the
increment is done as a string,
preserving each character within its
range.
As prevously mentioned, ++ on strings is "magic" in that it operates differently based on the content of the string and the context in which the string is used.
To illustrate the problem and assuming:
my $s='ZCL0';
then
print ++$s;
will print:
ZCL1
while
$s+=0; print ++$s;
prints
1
NB: In other popular programming languages, the ++ is legal for numeric values only.
Using non-intuitive, "magic" features of Perl is discouraged as they lead to confusing and possibly unsupportable code.
You can write this almost as succinctly without relying on the magic ++ behavior:
s/(\d+)$/ $1 + 1 /e
The e flag makes it an expression substitution.

hash collision and appending data

Assume I have two strings (or byte arrays) A and B which both have the same hash (with hash I mean things like MD5 or SHA1). If I concatenate another string behind it, will A+C and B+C have the same hash H' as well? What happens to C+A and C+B?
I tested it with MD5 and in all my tests, appending something to the end made the hash the same, but appending at the beginning did not.
Is this always true (for all inputs)?
Is this true for all (well-known) hash functions? If no, is there a (well-known) hash function, where A+C and B+C will not collide (and C+A and C+B do not either)?
(besides from MD5(x + reverse(x)) and other constructed stuff I mean)
Details depend on the hash function H, but generally they work as follows:
Consume a block of input X (say, 512 bits)
Break the input into smaller pieces (say, 32 bits) and update hash internal state based on the input
If there's more input, go to step 1
At the end, spit the internal state out as the hash value H(X)
So, if A and B collide i.e. H(A) = H(B), the hash will be in the same state after consuming them. Updating the state further with the same input C can make the resulting hash value identical. This explains why H(A+C) is sometimes H(B+C). But it depends how A's and B's sizes are aligned to input block size and how the hash breaks the input block internally.
C+A and C+B can be identical if C is a multiple of the hash block size but probably not otherwise.
This depends entirely on the hash function. Also, the probability that you have those collisions is really small.
The hash functions being discussed here are typically cryptographic (SHA1, MD5). These hash functions have an Avalanche effect -- the output will change drastically with a slight change in the input.
The prefix and suffix extension of C will effectively make a longer input.
So, adding anything to the front or rear of the input should change the effective hash outputs significantly.
I do not understand how you did the MD5 check, here is my test.
echo "abcd" | md5sum
70fbc1fdada604e61e8d72205089b5eb
echo "0abcd" | md5sum
f5ac8127b3b6b85cdc13f237c6005d80
echo "abcd0" | md5sum
4c8a24d096de5d26c77677860a3c50e3
Are you saying that you located two inputs which had the same MD5 hash and then appended something to the end or beginning of the input and found that adding at the end resulted in the same MD5 as that for the original input?
Please provide samples with your test results.