Comparing two email address lists anonymously

Comparing two email address lists anonymously - email

Given two lists:
Company A:
user1#example.com
user2#example.com
user3#example.com
user4#example.com
Company B:
user2#example.com
user4#example.com
user5#example.com
Is there a way to anonymously compare them to get the number of email addresses in common (i.e., 2) without either company knowing which addresses were the ones in common?
Background:
Let's say that company A and company B want to know what portion of their userbase is common. For simplicity, they are just going to base it on email address and not concern themselves with people who use multiple addresses or different address variations (user+misc#example.com).
For the sake of privacy, neither company can give the other the plain list of email addresses. If they used the same simple hash, e.g. MD5, each company could easily know which members were in common (not desired). If they used a hash salted with a company specific secret, the addresses wouldn't be comparable any longer so the question couldn't be answered.
Is there some trick using key encryption or some other mathematical way to accomplish what I'm looking to do?

I believe this question could be understood better in the realm of cryptography.
It is a problem of secure multi-party computation.
I'm not aware of any bullet proof solution for this problem but I can think of the following:
Choose a commutative hash function (H):
H(H(string, seed1), seed2) = H(H(string, seed2), seed1)
Each party (Company A and Company B) has to choose a secret seed:
SEED_A, SEED_B
Company A hashes all email addresses using SEED_A, Company B hashes all email addresses using SEED_B.
They interchange the hashes.
Each company applies the hash function again on the set received from the opposing party.
At this point the data should already be garbled and the companies should not be able to recognize their own email addresses (since they've been already hashed twice - the second time with an unknown key).
All the email addresses should be laid out openly and those that have the same hash should be counted as the email addresses that belong to both companies (except that neither company can tell the source of the hash).
This is the theory. Hopefully I didn't miss anything and there are no flaws in the algorithm.
As for the implementation, here's the most trivial PHP script that I could come with:
$a = array("user1#example.com", "user2#example.com", "user3#example.com", "user4#example.com");
$b = array("user2#example.com", "user4#example.com", "user5#example.com");
function enc($str, $seed) {
for ($i = strlen($str) - 1; $i >= 0; $i--) {
$str[$i] = $str[$i] ^ $seed[$i % strlen($seed)];
}
return $str;
}
/* Company A */
$hashesForB = array();
$SEED_A = 'SALT FOR COMPANY A';
foreach ($a as $address) {
$hashesForB[] = enc($address, $SEED_A);
}
/* Company B */
$hashesForA = array();
$SALT_B = 'THIS IS THE SALT FOR COMPANY B';
foreach ($b as $address) {
$hashesForA[] = enc($address, $SALT_B);
}
/* Company A */
$hashesForB_2 = array();
foreach ($hashesForA as $hash) {
$hashesForB_2[] = enc($hash, $SEED_A);
}
/* Company B */
$hashesForA_2 = array();
foreach ($hashesForB as $hash) {
$hashesForA_2[] = enc($hash, $SALT_B);
}
$common = count(array_intersect($hashesForA_2, $hashesForB_2));
print $common; // it will output 2
Click here for the DEMO
As you can see in the code above, I used the XOR algorithm for (pseudo) hashing (actually, any addition based hash function should do the job).
Obviously, this is not the best choice for many reasons:
XOR will return the original input upon a new call with the same salt
the entropy is not the best you could hope for
the data is not truncated
Still, you could implement your own hashing function using the suggestions here, here, here or here.

Is the privacy concern that privacy agreement prohibit sharing of email addresses? or is it a competitive concern?
If you just want to get an idea of percentage of overlap, then I'd think a simple encoding of the email addresses might work. For example, de-dupe each list, Base64 encode each email address, then run the comparison to get overlap, then report on the numbers.
A simple NDA could make this a less technical problem.

It depends the language you want to use.
In python, you could use this script :
listA = ('user1#example.com', 'user2#example.com', 'user3#example.com')
listB = ('user1#example.com', 'user2#example.com')
result = [x for x in listA if x in listB]
print(len(result))
For security, you could host this script in an external server where both companies just can put in their lists and then check the result.

Related

Get Matching SHA256 Algorithm - Perl

Hi I'm trying to generate a similar sha256 hex, but I can't seem to get a matching one. I want to generate just about any password using a random key.
In this case, I'm using test123 : ecd71870d1963316a97e3ac3408c9835ad8cf0f3c1bc703527c30265534f75ae
Here is my code:
print "Final Hash: " . generateHash("ecd71870d1963316a97e3ac3408c9835ad8cf0f3c1bc703527c30265534f75ae", "fx4;)#?%") . chr(10);
sub generateHash {
my ($strPass, $strLoginKey) = #_;
my $strHash = encryptPass(uc($strPass), $strLoginKey);
return $strHash;
}
sub encryptPass {
my ($strPassword, $strKey) = #_;
my $strSalt = 'Y(02.>\'H}t":E1';
my $strSwapped = swapSHA($strPassword);
print "First Swap: " . $strSwapped . chr(10);
my $strHash = sha256_hex($strSwapped . $strKey . $strSalt);
print "Hashed Into: " . $strHash . chr(10);
my $strSwappedHash = swapSHA($strHash) . chr(10);
print "Last Swapped: " . $strSwappedHash . chr(10);
return $strSwappedHash;
}
sub swapSHA {
my ($strHash) = #_;
my $strSwapped = substr($strHash, 32, 32);
$strSwapped .= substr($strHash, 0, 32);
return $strSwapped;
}
Any help would be greatly appreciated!
The output I get:
Original Hash: ecd71870d1963316a97e3ac3408c9835ad8cf0f3c1bc703527c30265534f75ae
Hashed Into: 34b6bdd73b3943d7baebf7d0ff54934849a38ee09c387435727e2b88566b4b85
Last Swapped: 49a38ee09c387435727e2b88566b4b8534b6bdd73b3943d7baebf7d0ff549348
Final Hash: 34b6bdd73b3943d7baebf7d0ff54934849a38ee09c387435727e2b88566b4b85
I am trying to make the output have final value same as input
Final Hash: ecd71870d1963316a97e3ac3408c9835ad8cf0f3c1bc703527c30265534f75ae
and I want to do this by reversing the "Hashed Into" value.

SHA, as a hashing algorithm, is designed to prevent collisions. i.e. part of its power, and usefulness, is in limiting the strings which will hash to the same resultant value.
It sounds like you want to find a second string which will hash to the same hashed value as test123 hashes to. This kind of goes the intent of using SHA in the first place.
It is possible to brute force the values with SHA, i.e. given a hashed value, you can brute force the value that was hashed by computing hashes and comparing the hashed value to the target value. This will take some time. Other algorithms, such as bcrypt, are more difficult to brute force, but are more computationally expensive for you also.
Here is another post related to brute forcing SHA-512, which is effectively equivalent in algorithm to SHA-256. The linked post is Java as opposed to Perl, but the concepts are language agnostic. How long to brute force a salted SHA-512 hash? (salt provided)

You're badly misunderstanding what a hash is for. It's a ONE WAY street by design. It's also designed to have a very low probability of 'collision' - two source values that hash to the same result. And by 'very low' I mean 'for practical purposes, it doesn't'. A constrained string - such as a password - simply won't do it.
So what typically happens for passwords - my client takes my password, generates a hash, sends it to the server.
The server compares that against it's list - if the hash matches, we assume that my password was correct. This means at no point is my password sent 'in the clear' nor is possible to work out what it was by grabbing the hash.
To avoid duplicates showing up (e.g. two people with the same password) usually you'll hash some unique values. Simplistically - username + password, when hashed.
The purpose of authenticating against hashes, is to ensure the cleartext password is never required to be held anywhere - and that is all. You still need to secure you communication channel (to avoid replay attacks) and you still need to protect against brute force guessing of password.
But brute forcing hashes is by design an expensive thing to attempt. You will see places where 'rainbow tables' exist, where people have taken every valid password string, and hashed it, so they can rapidly crack retrieved hashes from the server. These are big, and took a long time to generate initially though, and are defeated at least partially by salting or embedding usernames into the hash.
But I cannot re-iterate strongly enough - don't ever hand roll your own security unless you're REALLY sure what's going on. You'll build in weaknesses that you didn't even know existed, and your only 'security' is that no one's bothered to look yet.

You should not do this. It is insecure and vulnerable to dictionary attacks.
The correct way to turn passwords into things you store, is to use a PBKDF like "bcrypt" (password-based-key-derivation-function).
Check out Digest::Bcrypt
Word of caution: if anyone ever tells you (or helps you) to use a "hash" for storing passwords, they do not know anything about security or cryptography. Smile at them, and ignore everything they say next.

Feasibility of extracting arbitrary locations from a given string? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I have many spreadsheets with travel information on them amongst other things.
I need to extract start and end locations where the row describes travel, and one or two more things from the row, but what those extra fields are shouldn't be important.
There is no known list of all locations and no fixed pattern of text, all that I can look for is location names.
The field I'm searching in has 0-2 locations, sometimes locations have aliases.
The Problem
If we have this:
00229 | 445 | RTF | Jan | trn_rtn_co | Chicago to Base1
00228 | 445 | RTF | Jan | train | Metroline to home coming from Base1
00228 | 445 | RTF | Jan | train_s | Standard train journey to Friends
I, for instance (though it will vary), will want this:
RTF|Jan|Chicago |Base1
RTF|Jan|Home |Base1
RTF|Jan|NULL |Friends
And then to go though, look up what Base1 and Friends mean for that person (whose unique ID is RTF) and replace them with sensible locations (assuming they only have one set of 'friends'):
RTF|Jan|Chicago |Rockford
RTF|Jan|Home |Rockword
RTF|Jan|NULL |Milwaukee
What I need
I need a way to pick out key words from the final column, such as: Metroline to home coming from Base1.
There are three types of words I'm looking for:
Home LocationsThese are known and limited, I can get these from a list
Home AliasesThese are known and limited, I can get these from a list
Away LocationsThese are unknown but cities/towns/etc in the UK I don't know how to recognize these in the string. This is my main problem
My Ideas
My go to program I thought of was awk, but I don't know if I can reliably search to find where a proper noun (i.e. location) is used for the location names.
Is there a package, library or dictionary of standard locations?
Can I get a program to scour the spreadsheets and 'learn' the names of locations?
This seems like a problem that would have been solved already (i.e. find words in a string of text), but I'm not certain what I'm doing, and I'm only a novice programmer.
Any help on what I can do would be appreciated.
Edit:
Any answer such as "US_Locations_Cities is something you could check against", "Check for strings mentioned in a file in awk using ...", "There is a library for language X that will let a program learn to recognise location names, it's not RegEx, but it might work", or "There is a dictionary of location names here" would be fine.
Ultimately anything that helps me do what I want to do (i.e get the location names!) would be excellent.

Sorry to tell you, but i think this is not 100% programmable.
The best bet would be to define some standard searches:
Chicago to Base1
[WORD] to [WORD]:
where "to" is fixed and you look for exactly one word before and after. the word before then is your source and word after your target
Metroline to home coming from Base1
[WORD] to [WORD] coming from [WORD]:
where "to" and "coming from" is fixed and you look for three words in the appropriate slots.
etc
if you can match a source and target -> ok
if you cannot match something then throw an error for that line and let the user decide or even better implement an appropiate correction and let the program automatically reevaluate that line.
these are non-trivial goals.
consider:
Cities out of us of a
Non english text entries
Abbreviations
for automatic error corrections try to match the found [WORD]'s with a list of us or other cities.
if the city is not found throw an error. if you find that error either include that not found city to your city list or translate a city name in a publicly known (official) name.

The best I can suggest is that, as long as your locations are all US cities, you can use a database of zip codes such as this one.
I don't know how you expect any program to pick up things like Friends or Base1

I have to agree with hacktick that as it stands now, it is not programmable. It seems that the only solution is to invent a language or protocol.
I think an easy implementation follows:
In this language you have two keywords: to and from (you could also possibly allocate at as a keyword synoym for from as well).
These keywords define a portion of string that follows as a "scan area" for
recognizing names
I'm only planning on implementing the simplest scan, but as indicated at the end of the post allows you to do your fallback.
In the implementation you have a "Preferred Name" hash, where you define the names that you want displayed for things that appear there.
{ Base1 => 'Rockford'
, Friends => 'Milwaukee'
, ...
}
You could split your sentences by chunks of text between the keywords, using the following rules:
A. First chunk, if not a keyword is taken as the value of 'from'.
A. On this or any subsequent chunk, if keyword then save the next chunk
after that for that value.
A. Each value is "scanned" for a preferred phrase before being stored
as the value.
my #chunks
= grep {; defined and ( s/^\s+//, s/\s+$//, length ) }
split /\b(from|to)\s+/i, $note
;
my %parts = ( to => '', from => '' );
my $key;
do {
last unless my $chunk = shift #chunks;
if ( $key ) {
$parts{ $key } = $preferred_title{ $chunk } // $chunk;
$key = '';
}
elsif ( exists $parts{ lc $chunk } ) {
$key = lc $chunk;
}
elsif ( !$parts{from} ) {
$parts{from} = $preferred_title{ $chunk } // $chunk;
}
} while ( #chunks );
say join( '|', $note, #parts{ qw<from to> } );
At the very least, collecting these values and printing them out can give you a sieve to decide on further courses of action. This will tell you that 'home coming' is perceived as a 'from' statement, as well as 'Standard train journey'.
You *could fix the 'home coming' by amending the regex thusly:
/\b(?:(?:coming )?(from)|(to))\s+/i
And we could add the following key-value pair to our preferred_title hash:
home => 'Home'
We could simply define 'Standard train journey' => '', or we could create a list of rejection patterns, where we reject a string as a meaningful value if they fit a pattern.
But they allow you to dump out a list of values and refine your scan of data. Another idea is that as it seems that your pretty consistent with your use of capitals (except for 'home') for places. So we could increase our odds of finding the right string by matching the chunk with
/\b(home|\p{Upper}.*)/
Note that this still considers 'Standard train journey' a proper location. So this would still need to be handled by rejection rules.
Here I reiterate that this can be a minimal approach to scanning the data to the point that you can make sense of what it this system takes to be locations and "80/20" it down: that is, hopefully those rules handle 80 percent of the cases, and you can tune the algorithm to handle 80 percent of the remaining 20, and iterate to the point that you simply have to change a handful of entries at worst.
Then, you have a specification that you would need to follow in creating travel notes from then on. You could even scan the notes as they were entered and alert something like
'No destination found in note!'.

How do you use SHA256 to create a token of key,value pairs and a secret signature?

I want to validate some hidden input fields (to make sure they arent changed on submission) with the help of a sha-encoded string of the key value pairs of these hidden fields. I saw examples of this online but I didnt understand how to encode and
decode the values with a dynamic secret value. Can someone help me understand how to do this in perl?
Also which signature type (MD5, SHA1, SHA256, etc), has a good balance of performance and security?
update
So, how do you decode the string once you get it encoded?

What you really need is not a plain hash function, but a message authentication code such as HMAC. Since you say you'd like to use SHA-256, you might like HMAC_SHA256, which is available in Perl via the Digest::SHA module:
use Digest::SHA qw(hmac_sha256_base64);
my $mac = hmac_sha256_base64( $string, $key );
Here, $key is an arbitrary key, which you should keep secret, and $string contains the data you want to sign. To apply this to a more complex data structure (such as a hash of key–value pairs), you first need to convert it to a string. There are several ways to do that; for example, you could use Storable:
use Storable qw(freeze);
sub pairs_to_string {
local $Storable::canonical = 1;
my %hash = #_;
return freeze( \%hash );
}
You could also URL-encoding, as suggested by David Schwartz. The important thing is that, whatever method you use, it should always return the exact same string when given the same hash as input.
Then, before sending the data to the user, you calculate a MAC for them and include it as an extra field in the data. When you receive the data back, you remove the MAC field (and save its value), recalculate the MAC for the remaining fields and compare it to the value you received. If they don't match, someone (or something) has tampered with the data. Like this:
my $key = "secret";
sub mac { hmac_sha256_base64( pairs_to_string(#_), $key ) }
# before sending data to client:
my %data = (foo => "something", bar => "whatever");
$data{mac} = mac( %data );
# after receiving %data back from client:
my $mac = delete $data{mac};
die "MAC mismatch" if $mac ne mac( %data );
Note that there are some potential tricks this technique doesn't automatically prevent, such as replay attacks: once you send the data and MAC to the user, they'll learn the MAC corresponding to the particular set of data, and could potentially replace the fields in a later form with values saved from an earlier form. To protect yourself against such attacks, you should include enough identifying information in the data protected by the MAC to ensure that you can detect any potentially harmful replays. Ideally, you'd want to include a unique ID in every form and check that no ID is ever submitted twice, but that may not always be practical. Failing that, it may be a good idea to include a user ID (so that a malicious user can't trick someone else into submitting their data) and a form ID (so that a user can't copy data from one form to another) and perhaps a timestamp and/or a session ID (so that you can reject old data) in the form (and in the MAC calculation).

I don't know what you mean by "unpack", but you can't get original string from the hash.
Let's understand the problem: you render some hidden fields and you want to make sure that they're submitted unchanged, right? Here's how you can ensure that.
Let's suppose you have two variables:
first: foo
second: bar
You can hash them together with a secret key:
secret_key = "ysEJbKTuJU6u"
source_string = secret_key + "first" + "foo" + "second" + "bar"
hash = MD5(source_string)
# => "1adfda97d28af6535ef7e8fcb921d3f0"
Now you can render your markup:
<input type="hidden" name="first" value="foo" />
<input type="hidden" name="second" value="bar" />
<input type="hidden" name="hash" value="1adfda97d28af6535ef7e8fcb921d3f0">
Upon form submission, you get values of first and second fields, concat them to your secret key in a similar manner and hash again.
If hashes are equal, your values haven't been changed.
Note: never render secret key to the client. And sort key/value pairs before hashing (to eliminate dependency on order).
( disclaimer: I am not a crypto person, so you may just stop reading now)
As for performance/security, even though MD5 was found to have a weakness, it's still pretty usable, IMHO. SHA1 has a theoretical weakness, although no successful attack has been made yet. There are no known weaknesses in SHA-256.

For this application, any of the encryption algorithms is fine. You can pack the values any way you want, so long as it's repeatable. One common method is to pack the fields into a string the same way you would encode them into a URL for a GET request (name=value).
To compute the hash, create a text secret that can be whatever you want. It should be at least 12 bytes long though. Compute the hash of the secret concatenated with the packed fields and append that onto the end.
So, say you picked MD5, a secret of JS90320ERHe2 and you have these fields:
first_name = Jack
last_name = Smith
other_field = 7=2
First, URL encode it:
first_name=Jack&last_name=Smith&other_field=7%3d=2
Then compute the MD5 hash of
JS90320ERHe2first_name=Jack&last_name=Smith&other_field=7%3d=2
Which is 6d0fa69703935efaa183be57f81d38ea. The final encoded field is:
first_name=Jack&last_name=Smith&other_field=7%3d=2&hash=6d0fa69703935efaa183be57f81d38ea
So that's what you pass to the user. To validate it, remove the hash from the end, compute the MD5 hash by concatenating what's left with the secret, and if the hashes match, the field hasn't been tampered with.
Nobody can compute their own valid MD5 because they don't know to prefix the string with.
Note that an adversary can re-use any old valid value set. They just can't create their own value set from scratch or modify an existing one and have it test valid. So make sure you include something in the information so you can verify that it is suitable for the purpose it has been used.

Generate unique 3 letter/number code and compare to existing ones in PHP/MySQL

I'm making a code generation script for UN/LOCODE system and the database has unique 3 letter/number codes in every country. So for example the database contains "EE TLL", EE being the country (Estonia) and TLL the unique code inside Estonia, "AR TLL" can also exist (the country code and the 3 letter/number code are stored separately). Codes are in capital letters.
The database is fairly big and already contains a huge number of locations, the user has also the possibility of entering the 3 letter/number him/herself (which will be checked against the database before submission automatically).
Finally neither 0 or 1 may be used (possible confusion with O and I).
What I'm searching for is the most efficient way to pick the next available code when none is provided.
What I've came up with:
I'd check with AAA till 999, but then for each code it would require a new query (slow?).
I could store all the 40000 possibilities in an array and subtract all the used codes that are already in the database... but that uses too much memory IMO (not sure what I'm talking about here actually, maybe 40000 isn't such a big number).
Generate a random code and hope it doesn't exist yet and see if it does, if it does start over again. That's just risk taking.
Is there some magic MySQL query/PHP script that can get me the next available code?

I will go with number 2, it is simple and 40000 is not a big number.
To make it more efficient, you can store a number representing each 3-letter code. The conversion should be trivial because you have a total of 34 (A-Z, 2-9) letters.

I would for option 1 (i.e. do a sequential search), adding a table that gives the last assigned code per country (i.e. such that AAA..code are all assigned already). When assigning a new code through sequential scan, that table gets updated; for user-assigned codes, it remains unmodified.
If you don't want to issue repeated queries, you can also write this scan as a stored routine.
To simplify iteration, it might be better to treat the three-letter codes as numbers (as Shawn Hsiao suggests), i.e. give a meaning to A-Z = 0..25, and 2..9 = 26..33. Then, XYZ is the number X*34^2+Y*34+Z == 23*1156+24*34+25 == 27429. This should be doable using standard MySQL functions, in particular using CONV.

I went with the 2nd option. I was also able to make a script that will try to match as close as possible the country name, for example for Tartu it will try to match T** then TA* and if possible TAR, if not it will try TAT as T is the next letter after R in Tartu.
The code is quite extensive, I'll just post the part that takes the first possible code:
$allowed = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ23456789';
$length = strlen($allowed);
$codes = array();
// store all possibilities in a huge array
for($i=0;$i<$length;$i++)
for($j=0;$j<$length;$j++)
for($k=0;$k<$length;$k++)
$codes[] = substr($allowed, $i, 1).substr($allowed, $j, 1).substr($allowed, $k, 1);
$used = array();
$query = mysql_query("SELECT code FROM location WHERE country = '$country'");
while ($result = mysql_fetch_array($query))
$used[] = $result['code'];
$remaining = array_diff($codes, $used);
$code = $remaining[0];
Thanks for your opinion, this will be the key to transport codes all over the world :)

Hash Collision in fairly simple encrypt/decrypt code

I'm trying to add a small level of security to a site and encode some ids. The id's are already a concat of linked table rows, so storing the encryption in the db isn't very efficient. Therefore I need to encode & decode the string.
I found this great little function from myphpscripts, and I'm wondering what the chances are of collisions.
I really don't know much about these sorts of things. I'm assuming that the longer my key, the less collisions i'm going to have.
I could end up with more than 10 million unique concatenated ids, and want to be sure I'm not going to run into issues.
function encode($string,$key) {
$key = sha1($key);
$strLen = strlen($string);
$keyLen = strlen($key);
$j=0;
$hash='';
for ($i = 0; $i < $strLen; $i++) {
$ordStr = ord(substr($string,$i,1));
if ($j == $keyLen) { $j = 0; }
$ordKey = ord(substr($key,$j,1));
$j++;
$hash .= strrev(base_convert(dechex($ordStr + $ordKey),16,36));
}
return $hash;
}

I think you are a bit confused about this issue.
The problem of collisions only applies to mappings that are not 1-to-1, but "lossy", i.e. map several different inputs to one ouput (such as hashes).
What you linked to looks like an encryption/decryption routine (if it works correctly, which I didn't check). Encryption by definition means that there is a matching decryption, hence the mapping defined by the encryption cannot have collisions (as you could not decrypt in that case).
So your question, as posted, does not make sense.
That said, I would strongly suggest you do not use bandaids like encrypting IDs. Just store the IDs server-side and generate a session key to refer to them.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse