Hash Collision in fairly simple encrypt/decrypt code - hash

I'm trying to add a small level of security to a site and encode some ids. The id's are already a concat of linked table rows, so storing the encryption in the db isn't very efficient. Therefore I need to encode & decode the string.
I found this great little function from myphpscripts, and I'm wondering what the chances are of collisions.
I really don't know much about these sorts of things. I'm assuming that the longer my key, the less collisions i'm going to have.
I could end up with more than 10 million unique concatenated ids, and want to be sure I'm not going to run into issues.
function encode($string,$key) {
$key = sha1($key);
$strLen = strlen($string);
$keyLen = strlen($key);
$j=0;
$hash='';
for ($i = 0; $i < $strLen; $i++) {
$ordStr = ord(substr($string,$i,1));
if ($j == $keyLen) { $j = 0; }
$ordKey = ord(substr($key,$j,1));
$j++;
$hash .= strrev(base_convert(dechex($ordStr + $ordKey),16,36));
}
return $hash;
}

I think you are a bit confused about this issue.
The problem of collisions only applies to mappings that are not 1-to-1, but "lossy", i.e. map several different inputs to one ouput (such as hashes).
What you linked to looks like an encryption/decryption routine (if it works correctly, which I didn't check). Encryption by definition means that there is a matching decryption, hence the mapping defined by the encryption cannot have collisions (as you could not decrypt in that case).
So your question, as posted, does not make sense.
That said, I would strongly suggest you do not use bandaids like encrypting IDs. Just store the IDs server-side and generate a session key to refer to them.

Related

Get Matching SHA256 Algorithm - Perl

Hi I'm trying to generate a similar sha256 hex, but I can't seem to get a matching one. I want to generate just about any password using a random key.
In this case, I'm using test123 : ecd71870d1963316a97e3ac3408c9835ad8cf0f3c1bc703527c30265534f75ae
Here is my code:
print "Final Hash: " . generateHash("ecd71870d1963316a97e3ac3408c9835ad8cf0f3c1bc703527c30265534f75ae", "fx4;)#?%") . chr(10);
sub generateHash {
my ($strPass, $strLoginKey) = #_;
my $strHash = encryptPass(uc($strPass), $strLoginKey);
return $strHash;
}
sub encryptPass {
my ($strPassword, $strKey) = #_;
my $strSalt = 'Y(02.>\'H}t":E1';
my $strSwapped = swapSHA($strPassword);
print "First Swap: " . $strSwapped . chr(10);
my $strHash = sha256_hex($strSwapped . $strKey . $strSalt);
print "Hashed Into: " . $strHash . chr(10);
my $strSwappedHash = swapSHA($strHash) . chr(10);
print "Last Swapped: " . $strSwappedHash . chr(10);
return $strSwappedHash;
}
sub swapSHA {
my ($strHash) = #_;
my $strSwapped = substr($strHash, 32, 32);
$strSwapped .= substr($strHash, 0, 32);
return $strSwapped;
}
Any help would be greatly appreciated!
The output I get:
Original Hash: ecd71870d1963316a97e3ac3408c9835ad8cf0f3c1bc703527c30265534f75ae
Hashed Into: 34b6bdd73b3943d7baebf7d0ff54934849a38ee09c387435727e2b88566b4b85
Last Swapped: 49a38ee09c387435727e2b88566b4b8534b6bdd73b3943d7baebf7d0ff549348
Final Hash: 34b6bdd73b3943d7baebf7d0ff54934849a38ee09c387435727e2b88566b4b85
I am trying to make the output have final value same as input
Final Hash: ecd71870d1963316a97e3ac3408c9835ad8cf0f3c1bc703527c30265534f75ae
and I want to do this by reversing the "Hashed Into" value.
SHA, as a hashing algorithm, is designed to prevent collisions. i.e. part of its power, and usefulness, is in limiting the strings which will hash to the same resultant value.
It sounds like you want to find a second string which will hash to the same hashed value as test123 hashes to. This kind of goes the intent of using SHA in the first place.
It is possible to brute force the values with SHA, i.e. given a hashed value, you can brute force the value that was hashed by computing hashes and comparing the hashed value to the target value. This will take some time. Other algorithms, such as bcrypt, are more difficult to brute force, but are more computationally expensive for you also.
Here is another post related to brute forcing SHA-512, which is effectively equivalent in algorithm to SHA-256. The linked post is Java as opposed to Perl, but the concepts are language agnostic. How long to brute force a salted SHA-512 hash? (salt provided)
You're badly misunderstanding what a hash is for. It's a ONE WAY street by design. It's also designed to have a very low probability of 'collision' - two source values that hash to the same result. And by 'very low' I mean 'for practical purposes, it doesn't'. A constrained string - such as a password - simply won't do it.
So what typically happens for passwords - my client takes my password, generates a hash, sends it to the server.
The server compares that against it's list - if the hash matches, we assume that my password was correct. This means at no point is my password sent 'in the clear' nor is possible to work out what it was by grabbing the hash.
To avoid duplicates showing up (e.g. two people with the same password) usually you'll hash some unique values. Simplistically - username + password, when hashed.
The purpose of authenticating against hashes, is to ensure the cleartext password is never required to be held anywhere - and that is all. You still need to secure you communication channel (to avoid replay attacks) and you still need to protect against brute force guessing of password.
But brute forcing hashes is by design an expensive thing to attempt. You will see places where 'rainbow tables' exist, where people have taken every valid password string, and hashed it, so they can rapidly crack retrieved hashes from the server. These are big, and took a long time to generate initially though, and are defeated at least partially by salting or embedding usernames into the hash.
But I cannot re-iterate strongly enough - don't ever hand roll your own security unless you're REALLY sure what's going on. You'll build in weaknesses that you didn't even know existed, and your only 'security' is that no one's bothered to look yet.
You should not do this. It is insecure and vulnerable to dictionary attacks.
The correct way to turn passwords into things you store, is to use a PBKDF like "bcrypt" (password-based-key-derivation-function).
Check out Digest::Bcrypt
Word of caution: if anyone ever tells you (or helps you) to use a "hash" for storing passwords, they do not know anything about security or cryptography. Smile at them, and ignore everything they say next.

Comparing two email address lists anonymously

Given two lists:
Company A:
user1#example.com
user2#example.com
user3#example.com
user4#example.com
Company B:
user2#example.com
user4#example.com
user5#example.com
Is there a way to anonymously compare them to get the number of email addresses in common (i.e., 2) without either company knowing which addresses were the ones in common?
Background:
Let's say that company A and company B want to know what portion of their userbase is common. For simplicity, they are just going to base it on email address and not concern themselves with people who use multiple addresses or different address variations (user+misc#example.com).
For the sake of privacy, neither company can give the other the plain list of email addresses. If they used the same simple hash, e.g. MD5, each company could easily know which members were in common (not desired). If they used a hash salted with a company specific secret, the addresses wouldn't be comparable any longer so the question couldn't be answered.
Is there some trick using key encryption or some other mathematical way to accomplish what I'm looking to do?
I believe this question could be understood better in the realm of cryptography.
It is a problem of secure multi-party computation.
I'm not aware of any bullet proof solution for this problem but I can think of the following:
Choose a commutative hash function (H):
H(H(string, seed1), seed2) = H(H(string, seed2), seed1)
Each party (Company A and Company B) has to choose a secret seed:
SEED_A, SEED_B
Company A hashes all email addresses using SEED_A, Company B hashes all email addresses using SEED_B.
They interchange the hashes.
Each company applies the hash function again on the set received from the opposing party.
At this point the data should already be garbled and the companies should not be able to recognize their own email addresses (since they've been already hashed twice - the second time with an unknown key).
All the email addresses should be laid out openly and those that have the same hash should be counted as the email addresses that belong to both companies (except that neither company can tell the source of the hash).
This is the theory. Hopefully I didn't miss anything and there are no flaws in the algorithm.
As for the implementation, here's the most trivial PHP script that I could come with:
$a = array("user1#example.com", "user2#example.com", "user3#example.com", "user4#example.com");
$b = array("user2#example.com", "user4#example.com", "user5#example.com");
function enc($str, $seed) {
for ($i = strlen($str) - 1; $i >= 0; $i--) {
$str[$i] = $str[$i] ^ $seed[$i % strlen($seed)];
}
return $str;
}
/* Company A */
$hashesForB = array();
$SEED_A = 'SALT FOR COMPANY A';
foreach ($a as $address) {
$hashesForB[] = enc($address, $SEED_A);
}
/* Company B */
$hashesForA = array();
$SALT_B = 'THIS IS THE SALT FOR COMPANY B';
foreach ($b as $address) {
$hashesForA[] = enc($address, $SALT_B);
}
/* Company A */
$hashesForB_2 = array();
foreach ($hashesForA as $hash) {
$hashesForB_2[] = enc($hash, $SEED_A);
}
/* Company B */
$hashesForA_2 = array();
foreach ($hashesForB as $hash) {
$hashesForA_2[] = enc($hash, $SALT_B);
}
$common = count(array_intersect($hashesForA_2, $hashesForB_2));
print $common; // it will output 2
Click here for the DEMO
As you can see in the code above, I used the XOR algorithm for (pseudo) hashing (actually, any addition based hash function should do the job).
Obviously, this is not the best choice for many reasons:
XOR will return the original input upon a new call with the same salt
the entropy is not the best you could hope for
the data is not truncated
Still, you could implement your own hashing function using the suggestions here, here, here or here.
Is the privacy concern that privacy agreement prohibit sharing of email addresses? or is it a competitive concern?
If you just want to get an idea of percentage of overlap, then I'd think a simple encoding of the email addresses might work. For example, de-dupe each list, Base64 encode each email address, then run the comparison to get overlap, then report on the numbers.
A simple NDA could make this a less technical problem.
It depends the language you want to use.
In python, you could use this script :
listA = ('user1#example.com', 'user2#example.com', 'user3#example.com')
listB = ('user1#example.com', 'user2#example.com')
result = [x for x in listA if x in listB]
print(len(result))
For security, you could host this script in an external server where both companies just can put in their lists and then check the result.

How do you use SHA256 to create a token of key,value pairs and a secret signature?

I want to validate some hidden input fields (to make sure they arent changed on submission) with the help of a sha-encoded string of the key value pairs of these hidden fields. I saw examples of this online but I didnt understand how to encode and
decode the values with a dynamic secret value. Can someone help me understand how to do this in perl?
Also which signature type (MD5, SHA1, SHA256, etc), has a good balance of performance and security?
update
So, how do you decode the string once you get it encoded?
What you really need is not a plain hash function, but a message authentication code such as HMAC. Since you say you'd like to use SHA-256, you might like HMAC_SHA256, which is available in Perl via the Digest::SHA module:
use Digest::SHA qw(hmac_sha256_base64);
my $mac = hmac_sha256_base64( $string, $key );
Here, $key is an arbitrary key, which you should keep secret, and $string contains the data you want to sign. To apply this to a more complex data structure (such as a hash of key–value pairs), you first need to convert it to a string. There are several ways to do that; for example, you could use Storable:
use Storable qw(freeze);
sub pairs_to_string {
local $Storable::canonical = 1;
my %hash = #_;
return freeze( \%hash );
}
You could also URL-encoding, as suggested by David Schwartz. The important thing is that, whatever method you use, it should always return the exact same string when given the same hash as input.
Then, before sending the data to the user, you calculate a MAC for them and include it as an extra field in the data. When you receive the data back, you remove the MAC field (and save its value), recalculate the MAC for the remaining fields and compare it to the value you received. If they don't match, someone (or something) has tampered with the data. Like this:
my $key = "secret";
sub mac { hmac_sha256_base64( pairs_to_string(#_), $key ) }
# before sending data to client:
my %data = (foo => "something", bar => "whatever");
$data{mac} = mac( %data );
# after receiving %data back from client:
my $mac = delete $data{mac};
die "MAC mismatch" if $mac ne mac( %data );
Note that there are some potential tricks this technique doesn't automatically prevent, such as replay attacks: once you send the data and MAC to the user, they'll learn the MAC corresponding to the particular set of data, and could potentially replace the fields in a later form with values saved from an earlier form. To protect yourself against such attacks, you should include enough identifying information in the data protected by the MAC to ensure that you can detect any potentially harmful replays. Ideally, you'd want to include a unique ID in every form and check that no ID is ever submitted twice, but that may not always be practical. Failing that, it may be a good idea to include a user ID (so that a malicious user can't trick someone else into submitting their data) and a form ID (so that a user can't copy data from one form to another) and perhaps a timestamp and/or a session ID (so that you can reject old data) in the form (and in the MAC calculation).
I don't know what you mean by "unpack", but you can't get original string from the hash.
Let's understand the problem: you render some hidden fields and you want to make sure that they're submitted unchanged, right? Here's how you can ensure that.
Let's suppose you have two variables:
first: foo
second: bar
You can hash them together with a secret key:
secret_key = "ysEJbKTuJU6u"
source_string = secret_key + "first" + "foo" + "second" + "bar"
hash = MD5(source_string)
# => "1adfda97d28af6535ef7e8fcb921d3f0"
Now you can render your markup:
<input type="hidden" name="first" value="foo" />
<input type="hidden" name="second" value="bar" />
<input type="hidden" name="hash" value="1adfda97d28af6535ef7e8fcb921d3f0">
Upon form submission, you get values of first and second fields, concat them to your secret key in a similar manner and hash again.
If hashes are equal, your values haven't been changed.
Note: never render secret key to the client. And sort key/value pairs before hashing (to eliminate dependency on order).
( disclaimer: I am not a crypto person, so you may just stop reading now)
As for performance/security, even though MD5 was found to have a weakness, it's still pretty usable, IMHO. SHA1 has a theoretical weakness, although no successful attack has been made yet. There are no known weaknesses in SHA-256.
For this application, any of the encryption algorithms is fine. You can pack the values any way you want, so long as it's repeatable. One common method is to pack the fields into a string the same way you would encode them into a URL for a GET request (name=value).
To compute the hash, create a text secret that can be whatever you want. It should be at least 12 bytes long though. Compute the hash of the secret concatenated with the packed fields and append that onto the end.
So, say you picked MD5, a secret of JS90320ERHe2 and you have these fields:
first_name = Jack
last_name = Smith
other_field = 7=2
First, URL encode it:
first_name=Jack&last_name=Smith&other_field=7%3d=2
Then compute the MD5 hash of
JS90320ERHe2first_name=Jack&last_name=Smith&other_field=7%3d=2
Which is 6d0fa69703935efaa183be57f81d38ea. The final encoded field is:
first_name=Jack&last_name=Smith&other_field=7%3d=2&hash=6d0fa69703935efaa183be57f81d38ea
So that's what you pass to the user. To validate it, remove the hash from the end, compute the MD5 hash by concatenating what's left with the secret, and if the hashes match, the field hasn't been tampered with.
Nobody can compute their own valid MD5 because they don't know to prefix the string with.
Note that an adversary can re-use any old valid value set. They just can't create their own value set from scratch or modify an existing one and have it test valid. So make sure you include something in the information so you can verify that it is suitable for the purpose it has been used.

Key value store in Perl

I hope this question is still on topic, but recently I found a key-value store programmed in Perl. It was pretty simple, RAM based and I think it had just set and get and also an 'expire' option for keys. I also think it came with as both XS and pure Perl version.
I have been searching for quite a while now and I not sure whether it is on CPAN or I saw it on GitHub. Maybe someone knows what I am talking about.
It might be helpful in narrowing things down if you could explain what exactly the module does that is special in that regard. If you're looking to implement something with caching in general, I'd point you towards CHI, which is basically a common API with multiple caching drivers.
Do you mean Cache? It can store key/value pairs in a number of places, including shared memory.
It sounds like you are describing Memcached. There is a Perl interface on CPAN.
I've used Tie::Cache for this in the past with excellent results. It created a tied hash variable that exhibits LRU behavior when it grows beyond a configured maximum key count.
my $cache_size = 1000;
use vars qw(cache);
%cache = ();
tie %cache, 'Tie::Cache', $cache_size;
From here, you can store hash/value pairs (of course, the value side can be a reference) in %cache and should its size grow to 1000 keys, the LRU keys will be deleted as more are added.
In my usage, I store the right-hand side as an arrayref holding the cached value along with a timestamp of when the entry was cached; my cache reference code checks the timestamp and deletes the key without using it if the entry isn't fresh enough:
sub getCacheMatch {
my $check_value = shift;
my $timeout = 600; # 10 minutes
# Check cache for a match.
my ($result, $time_cached);
my $now = time();
my $time_cached;
my $cache_entry = $cache{$check_value};
if ($cache_entry) {
($result, $time_cached) = #{$cache_entry};
if ($now - $time_cached > $timeout) {
delete $cache{$check_value);
return undef;
} else {
return $result;
}
}
}
And I update the cache elsewhere in the code like so:
$url{$cache_checkstring} = [$value_to_cache, $now];

Generate unique 3 letter/number code and compare to existing ones in PHP/MySQL

I'm making a code generation script for UN/LOCODE system and the database has unique 3 letter/number codes in every country. So for example the database contains "EE TLL", EE being the country (Estonia) and TLL the unique code inside Estonia, "AR TLL" can also exist (the country code and the 3 letter/number code are stored separately). Codes are in capital letters.
The database is fairly big and already contains a huge number of locations, the user has also the possibility of entering the 3 letter/number him/herself (which will be checked against the database before submission automatically).
Finally neither 0 or 1 may be used (possible confusion with O and I).
What I'm searching for is the most efficient way to pick the next available code when none is provided.
What I've came up with:
I'd check with AAA till 999, but then for each code it would require a new query (slow?).
I could store all the 40000 possibilities in an array and subtract all the used codes that are already in the database... but that uses too much memory IMO (not sure what I'm talking about here actually, maybe 40000 isn't such a big number).
Generate a random code and hope it doesn't exist yet and see if it does, if it does start over again. That's just risk taking.
Is there some magic MySQL query/PHP script that can get me the next available code?
I will go with number 2, it is simple and 40000 is not a big number.
To make it more efficient, you can store a number representing each 3-letter code. The conversion should be trivial because you have a total of 34 (A-Z, 2-9) letters.
I would for option 1 (i.e. do a sequential search), adding a table that gives the last assigned code per country (i.e. such that AAA..code are all assigned already). When assigning a new code through sequential scan, that table gets updated; for user-assigned codes, it remains unmodified.
If you don't want to issue repeated queries, you can also write this scan as a stored routine.
To simplify iteration, it might be better to treat the three-letter codes as numbers (as Shawn Hsiao suggests), i.e. give a meaning to A-Z = 0..25, and 2..9 = 26..33. Then, XYZ is the number X*34^2+Y*34+Z == 23*1156+24*34+25 == 27429. This should be doable using standard MySQL functions, in particular using CONV.
I went with the 2nd option. I was also able to make a script that will try to match as close as possible the country name, for example for Tartu it will try to match T** then TA* and if possible TAR, if not it will try TAT as T is the next letter after R in Tartu.
The code is quite extensive, I'll just post the part that takes the first possible code:
$allowed = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ23456789';
$length = strlen($allowed);
$codes = array();
// store all possibilities in a huge array
for($i=0;$i<$length;$i++)
for($j=0;$j<$length;$j++)
for($k=0;$k<$length;$k++)
$codes[] = substr($allowed, $i, 1).substr($allowed, $j, 1).substr($allowed, $k, 1);
$used = array();
$query = mysql_query("SELECT code FROM location WHERE country = '$country'");
while ($result = mysql_fetch_array($query))
$used[] = $result['code'];
$remaining = array_diff($codes, $used);
$code = $remaining[0];
Thanks for your opinion, this will be the key to transport codes all over the world :)