case and white space insensitive hashing function - hash

I am looking for a hashing function that is case insensitive and ignores white spaces as well.
for example:
the hash value generated for this is a hash and ThisIsAHash will be exactly the same.
does any such hash function exist?

Hash Functions are how we make them. For example:
First, for all strings ->
Step1. Lowercase them (or Uppercase them)
Step2. Strip all Whitespaces.
By now, both strings would map to: thisisahash
Step3. Now, apply any Hash function to it: crc32, java's polynomial or whatever...
Given a string, you can always now do a lookup and see if other Strings are hashed to the same key.
Note that hash functions are one-way. So doing Step1 and Step2 don't count against valid hash methods.

Related

What does the % mean in Perl?

It's been many years since I have been in Perl and I'm finding myself needing to get back up to speed. I have a a line of code that I do not know what it means.
%form = %$input;
Can someone explain or direct me to where I can find find the answer to this? I am not understanding the % symbol.
The short answer is that line copies the contents of a reference to a hash to a named hash. The programmer is likely uncomfortable with reference syntax. No big whoop.
The key concept that people miss about Perl 5's sigils is that they show how you are treating a variable, not what type it is. $ is single items, # is multiple items. % is hash stuff.
The % signifies a hash variable as a whole. So, %form is the entire hash named "form". But, to get a single element out of it, you use the $ (single element) sigil. When you see the {} after a variable, you know you are dealing with a hash:
%form # entire hash named "form"
$form{foo} # single value for key "foo" in hash form
#form{qw(foo bar)} # multiple value for keys "foo" and "bar" (slice)
The second one is more tricky (it's the stuff we cover in Intermediate Perl. $input is a reference to a hash. All references are scalars (so, the $ sigil). To use it as a hash, you have to deference it. For a simple scalar like that, you can put the hash sigil in front: %$input. Now you can treat that as a hash and use the hash operators (keys, values, delete) on it.
Starting with v5.26, there's also a postfix dereference so you can read your left to right: $input->%*.
%$input # entire hash referenced by $input
$input->%* # entire hash, with new hotness postfix deref
${$input}{foo} # single element access: extra $ in front, braces around ref
$$input{foo} # same thing
$input->{foo} # single element access with arrow and braces
#{$input}{qw(foo bar)} # hash slice, multiple items get `#`
#$input{qw(foo bar)} # same thing
$input->#{qw(foo bar)} # same thing, but with postfix notation
Now there's an even more tricky thing. v5.20 introduces the key-value slice, so the % gets some more work to do. This is a slice that returns the keys along with the values, so it gets the % for hash like things:
%form{qw(key1 key2)} # returns a list of key-value pairs
But, this also works on arrays to get the index and value. You know it's an array because you see the [], but you know it's returning index-value pairs because you see the %:
%array[1,3,7] # returns list like ( 1, ..., 3, ..., 7, ...)

Prefix preserving hash function

I am looking for a hash function f() whose outputs can preserve the prefix of the inputs. The detailed requirements are as followings.
f() takes variable-length bit strings as input and outputs bit strings;
assume a and b are bit strings and a is a substring of b, then f(a) is also a substring of f(b);
the length of the output bit string should be smaller than the input bit string.
Any idea?
There will be no such hash function that meets your criterion.
Suppose you have such hash function Hash that preserves prefix, then answer these questions:
(1) Hash("a") =? It could be anything, right?
(2) What about Hash("xa")=? to preserve the prefix, it has to be
|Hash("xa")-Hash("a")| + Hash("a")
(3) What about Hash("yxa")=? similarly as (2), it has to be
|Hash("yxa")-Hash("xa")| + |Hash("xa")-Hash("a")| + Hash("a")
So the hash will always have longer lengh than the original.

What do these lines in `dna2protein.pl` do?

I'm a newbie to perl and I found a script to convert a DNA sequence to protein sequence using Perl. I don't understand what some lines in that script do, specially the following:
my(%g)=('TCA'=>'S','TCC'=>'S','TCG'=>'S','TCT'=>'S','TTC'=>'F','TTT'=>'F','TTA'=>'L','TTG'=>'L','TAC'=>'Y','TAT'=>'Y','TAA'=>'_','TAG'=>'_','TGC'=>'C','TGT'=>'C','TGA'=>'_','TGG'=>'W','CTA'=>'L','CTC'=>'L','CTG'=>'L','CTT'=>'L','CCA'=>'P','CCC'=>'P','CCG'=>'P','CCT'=>'P','CAC'=>'H','CAT'=>'H','CAA'=>'Q','CAG'=>'Q','CGA'=>'R','CGC'=>'R','CGG'=>'R','CGT'=>'R','ATA'=>'I','ATC'=>'I','ATT'=>'I','ATG'=>'M','ACA'=>'T','ACC'=>'T','ACG'=>'T','ACT'=>'T','AAC'=>'N','AAT'=>'N','AAA'=>'K','AAG'=>'K','AGC'=>'S','AGT'=>'S','AGA'=>'R','AGG'=>'R','GTA'=>'V','GTC'=>'V','GTG'=>'V','GTT'=>'V','GCA'=>'A','GCC'=>'A','GCG'=>'A','GCT'=>'A','GAC'=>'D','GAT'=>'D','GAA'=>'E','GAG'=>'E','GGA'=>'G','GGC'=>'G','GGG'=>'G','GGT'=>'G');
if(exists $g{$codon})
{
return $g{$codon};
}
else
{
print STDERR "Bad codon \"$codon\"!!\n";
exit;
}
Can someone please explain?
My perl is rusty but anyway.
The first line creates a hash (which is perls version of a hash table). The variable is called g (a bad name BTW). The % sigil before g is used to indicate that it is a hash. Perl uses sigils to denote types. The hash is initialises using the double barrelled arrow syntax. 'TTT'=>'F' creates an entry TTT in the hash table with value F. The my is used to give the variable a local scope.
The next few lines are fairly self explanatory. It will check whether the hash contains an entry with key $codon. The $ sigil is used to indicate that it's a scalar value. If if exists, you get the value. Otherwise, it prints the message specified to the standard error.
Since you're new to Perl, you should read a little about Perl itself before you try to decrypt it's syntax on your own. (Perl values a good Huffman encoding, and is also somewhat encrypted. ;-)Start with the 'perldoc perlintro' command, and go from there. If you're using Ubunutu, for instance, this documentation can be installed via
$ sudo apt-get install perl-doc
but it is also available in this file: Perl Reference documentation
In addition to perlintro, some other suggested reading is perlsyn (syntax description), perldata (data structures), perlop (operators, including quotes), perlreftut (intro to references), and perlvar (predefined variables and their meanings), in roughly that order.
I learnt perl from these, and I still refer to them often.
Also, if your DNA script has POD documentation, then you can view that neatly by typing
$ perldoc <script-filename>
(of course, POD documentation is listed in the source, in a rougher form; read perlpod for more details on documentation fromat)
If you are new to Perl with an interest to understand more quickly, you might begin with this web collection learn.perl. A nice supplement is the online Perl documentation of perldoc. Good luck and have fun.
In this case it looks like the %g hash serves as both a way to identify whether a codon is within the set of valid condons (hash keys) and for some mapping to what type of codon it is (hash value).
Hashes serve as a way to link unique keys with a value, but they also serve as unique lists of keys. In some cases you may see keys added to a hash and set to undef. This is a good sign that the hash is being used to track unique values of some type.
The codon is being passed in to the function, upper cased and then a hash of codons is checked to see if there is codon of that value registered. If the codon exists the registered value for that codon is returned, otherwise an error is outputed and the program ends.
the my (%g) is creating a hash, which is a structure that allows you to quickly look up a value by giving a key for that value. So for instance 'TCA'=>'S' maps the value 'S' to 'TCA'. If you ask the g hash for the value held for 'TCA' you will get 'S' ($g{'TCA'} //will equal 'S' )

Perl autoincrement of string not working as before

I have some code where I am converting some data elements in a flat file. I save the old:new values to a hash which is written to a file at the end of processing. On subsequence execution, I reload into a hash so I can reuse previously converted values on additional data files. I also save the last conversion value so if I encounter an unconverted value, I can assign it a new converted value and add it to the hash.
I had used this code before (back in Feb) on six files with no issues. I have a variable that is set to ZCKL0 (last character is a zero) which is retrieved from a file holding the last used value. I apply the increment operator
...
$data{$olddata} = ++$dataseed;
...
and the resultant value in $dataseed is 1 instead of ZCKL1. The original starting seed value was ZAAA0.
What am I missing here?
Do you use the $dataseed variable in a numeric context in your code?
From perlop:
If you increment a variable that is
numeric, or that has ever been used in
a numeric context, you get a normal
increment. If, however, the variable
has been used in only string contexts
since it was set, and has a value that
is not the empty string and matches
the pattern /^[a-zA-Z][0-9]\z/ , the
increment is done as a string,
preserving each character within its
range.
As prevously mentioned, ++ on strings is "magic" in that it operates differently based on the content of the string and the context in which the string is used.
To illustrate the problem and assuming:
my $s='ZCL0';
then
print ++$s;
will print:
ZCL1
while
$s+=0; print ++$s;
prints
1
NB: In other popular programming languages, the ++ is legal for numeric values only.
Using non-intuitive, "magic" features of Perl is discouraged as they lead to confusing and possibly unsupportable code.
You can write this almost as succinctly without relying on the magic ++ behavior:
s/(\d+)$/ $1 + 1 /e
The e flag makes it an expression substitution.

hash collision and appending data

Assume I have two strings (or byte arrays) A and B which both have the same hash (with hash I mean things like MD5 or SHA1). If I concatenate another string behind it, will A+C and B+C have the same hash H' as well? What happens to C+A and C+B?
I tested it with MD5 and in all my tests, appending something to the end made the hash the same, but appending at the beginning did not.
Is this always true (for all inputs)?
Is this true for all (well-known) hash functions? If no, is there a (well-known) hash function, where A+C and B+C will not collide (and C+A and C+B do not either)?
(besides from MD5(x + reverse(x)) and other constructed stuff I mean)
Details depend on the hash function H, but generally they work as follows:
Consume a block of input X (say, 512 bits)
Break the input into smaller pieces (say, 32 bits) and update hash internal state based on the input
If there's more input, go to step 1
At the end, spit the internal state out as the hash value H(X)
So, if A and B collide i.e. H(A) = H(B), the hash will be in the same state after consuming them. Updating the state further with the same input C can make the resulting hash value identical. This explains why H(A+C) is sometimes H(B+C). But it depends how A's and B's sizes are aligned to input block size and how the hash breaks the input block internally.
C+A and C+B can be identical if C is a multiple of the hash block size but probably not otherwise.
This depends entirely on the hash function. Also, the probability that you have those collisions is really small.
The hash functions being discussed here are typically cryptographic (SHA1, MD5). These hash functions have an Avalanche effect -- the output will change drastically with a slight change in the input.
The prefix and suffix extension of C will effectively make a longer input.
So, adding anything to the front or rear of the input should change the effective hash outputs significantly.
I do not understand how you did the MD5 check, here is my test.
echo "abcd" | md5sum
70fbc1fdada604e61e8d72205089b5eb
echo "0abcd" | md5sum
f5ac8127b3b6b85cdc13f237c6005d80
echo "abcd0" | md5sum
4c8a24d096de5d26c77677860a3c50e3
Are you saying that you located two inputs which had the same MD5 hash and then appended something to the end or beginning of the input and found that adding at the end resulted in the same MD5 as that for the original input?
Please provide samples with your test results.