PostgreSQL: Match Email Addresses With or Without Subdomains - postgresql

Scenario
For most of its history, my company used subdomains in the email addresses, mostly by state, but others had division subdomains. A few examples of what we had include:
mo.widgits.com
sd.widgits.com
va.widgits.com
nhq.widgits.com
gis.widgits.com
tech.widgits.com
...and so on.
New Paradigm
A few years ago, top management decided that they wanted us all to be one happy family; as part of this cultural realignment, they changed everyone's email addresses to the single domain, in the format of firstname.lastname#widgits.com.
Present Challenges
In many of our corporate databases, we find a mixture of records using either the old format and the new format. For example, the same individual might have porky.pig#widgits.com in the employee system, and porky.pig#in.widgits.com in the training system. I have a need to match individuals up among the various systems regardless of which format email is used for them in that system.
Desired Matches
porky.pig#in.widgits.com = porky.pig#widgits.com -> true
mary.poppins#widgits.com = mary.poppins#nhq.widgits.com -> true
bob.baker#widgits.com = bob.barker#gis.widgits.com -> false
How to Accomplish This?
Is there a regex pattern that I can use to match email addresses regardless of which format they are? Or will I need to manually extract out the subdomain before attempting to match them?

Off the top of my head, you could strip off the subdomain from all email addresses before comparing them (that is, compare only the email name and domain). Something like this:
SELECT *
FROM emails
WHERE REGEXP_REPLACE(email1, '^(.*#).*?([^.]+\.[^.]+)$', '\1\2') =
REGEXP_REPLACE(email2, '^(.*#).*?([^.]+\.[^.]+)$', '\1\2');
Demo
Data:
WITH emails AS (
SELECT 'porky.pig#in.widgits.com' AS email1, 'porky.pig#widgits.com' AS email2 UNION ALL
SELECT 'mary.poppins#widgits.com', 'mary.poppins#nhq.widgits.com' UNION ALL
SELECT 'bob.baker#widgits.com','bob.barker#gis.widgits.com'
)
Here is an explanation of the regex pattern used:
^ start of the email
(.*#) match email name including # in \1
.*? consume content up, but not including
([^.]+\.[^.]+) final domain only (e.g. google.com)
$ end of the email
Then, we replace with \1\2 to effectively remove any subdomain components.

How about something like this?
SELECT
*
FROM
(
SELECT
table1.email,
table2.email,
SPLIT_PART(table1.email, '#', 1) AS table1_username,
SPLIT_PART(table2.email, '#', 1) AS table2_username,
SPLIT_PART(table1.email, '#', 2) AS table1_domain,
SPLIT_PART(table2.email, '#', 2) AS table2_domain
FROM
table1 CROSS
JOIN table2
) S
WHERE
(
table1_username = table2_username
AND (
table1_domain like '%.' || table2_domain
OR table2_domain like '%.' || table1_domain
)
);

Related

python 3.7 and ldap3 reading group membership

I am using Python 3.7 and ldap3. I can make a connection and retrieve a list of the groups in which I am interested. I am having trouble getting group members though.
server = Server('ldaps.ad.company.com', use_ssl=True, get_info=ALL)
with Connection(server, 'mydomain\\ldapUser', '******', auto_bind=True) as conn:
base = "OU=AccountGroups,OU=UsersAndGroups,OU=WidgetDepartment," \
+ "OU=LocalLocation,DC=ad,DC=company,DC=com"
criteria = """(
&(objectClass=group)
(
|(sAMAccountName=grp-*widgets*)
(sAMAccountName=grp-oldWidgets)
)
)"""
attributes = ['sAMAccountName', 'distinguishedName']
conn.search(base, criteria, attributes=attributes)
groups = conn.entries
At this point groups contains all the groups I want. I want to itterate over the groups to collect the members.
for group in groups:
# print(cn)
criteria = f"""
(&
(objectClass=person)
(memberof:1.2.840.113556.1.4.1941:={group.distinguishedName})
)
"""
# criteria = f"""
# (&
# (objectClass=person)
# (memberof={group.distinguishedName})
# )
# """
attributes = ['displayName', 'sAMAccountName', 'mail']
conn.search(base, criteria, attributes=attributes)
people = conn.entries
I know there are people in the groups but people is always an empty list. It doesn't matter if I do a recirsive search or not.
What am I missing?
Edit
There is a longer backstory to this question that is too long to go into. I have a theory about this particular issue though. I was running out of time and switched to a different python LDAP library -- which is working. I think the issue with this question might be that I "formated" the query over multiple lines. The new ldap lib (python-ldap) complained and I stripped out the newlines and it just worked. I have not had time to go back and test that theory with ldap3.
people is overwritten in each iteration of your loop over groups.
Maybe the search result for the last group entry in groups is just empty.
You should initialise an empty list outside of your loop and extend it with your results:
people = []
for group in groups:
...
conn.search(...)
people.extend(conn.entries)
Another note about your code snippet above. When combining objectClass definitions with attribute definitions in your search filter you may consider using the Reader class which will combine those internally.
Furthermore I would like to point out that I've created an object relational mapper where you can simply define your queries using declarative python syntax, e.g.:
from ldap3_orm import ObjectDef, Reader
from ldap3_orm.config import config
from ldap3_orm.connection import conn
PersonDef = ObjectDef("person", conn)
r = Reader(conn, PersonDef, config.base_dn, PersonDef.memberof == group.distinguishedName)
r.search()
ldap3-orm documentation can be found at http://code.bsm-felder.de/doc/ldap3-orm

SQL-Injection Troubles

We are working on a lab in class and I cannot seem to find what I am missing. The following code is an SQL Query for authenticating users:
$sel1 = mysql_query ("SELECT ID, name, locale, lastlogin, gender,
FROM USERS_TABLE
WHERE (name = ’$user’ OR email = ’$user’) AND pass = ’$pass’");
$chk = mysql_fetch_array($sel1);
if (found one record)
then {allow the user to login}
We are supposed to locate a the SQL-Injection vulnerability which I believe lies in:
WHERE (name = ’$user’ OR email = ’$user’) AND pass = ’$pass’");
To exploit it we are basically supposed to log in to an admin profile on a website with a pretty generic username and password form. The given information is that we know the profile name is admin. And we are supposed to exploit the username entry on the website only.
After reading the following article Security Idiots and a section out of a book Penetration Testing A Hands-On Introduction to Hacking by Georgia Weidman. These are some of the things I tried:
admin' OR 1--
admin'--
admin' AND 1=1--
And much more variations of this. My understanding is that I am selecting the admin profile completing that section with the "'" then forcing true and killing the rest of the code on that line. However, nothing I try seems to be working.
It is also important to note that for this lab we have special configured virtual machines that allow this attack to work.
So am I on the right track or am I not understanding the logic behind a SQL-Injection attack. I am not necessarily looking for what is the code I am looking for but I am worried I am heading off in the wrong direction and am missing something.
Any help is much appreciated. And I would be happy to elaborate on anything.
Thank you.
Thanks to Azi for pointing out the issue with space.
The SQL you where supposed to crack was
SELECT ID, name, locale, lastlogin, gender
FROM USERS_TABLE
WHERE (name = '$user' OR email = '$user') AND pass = '$pass'
Any of the admin' OR 1--, admin'--, admin' AND 1=1-- would not work because when user variable gets substituted with these inputs the statements become
SELECT ID, name, locale, lastlogin, gender
FROM USERS_TABLE
WHERE (name = 'admin' OR 1-- ' OR email = '$user') AND pass = '$pass'
SELECT ID, name, locale, lastlogin, gender
FROM USERS_TABLE
WHERE (name = 'admin'-- ' OR email = ’$user’) AND pass = ’$pass’
SELECT ID, name, locale, lastlogin, gender
FROM USERS_TABLE
WHERE (name = 'admin' AND 1=1-- ' OR email = ’$user’) AND pass = ’$pass’
All of which are syntax errors since opened bracket ( is not closed
When you give admin'-- (giving space at the end of your input is important) the input statement becomes
SELECT ID, name, locale, lastlogin, gender
FROM USERS_TABLE
WHERE (name = 'admin')-- ' OR email = ’$user’) AND pass = ’$pass’"
The SQL has no syntax error and will select the record with name as admin.

Break down keyword in URL, then check whether the keywords exist in content page

1) Can MATLAB break down the key words in URL?
eg:http://en.wikipedia.org/wiki/Hostname,
output:wikipedia wiki Hostname
2) After the output of keywords in URL then check whether the keywords exist in the content of the page like the content below, if yes then return 1, else return 0
Contents:
Hostname From Wikipedia, the free encyclopedia Jump to: navigation, search In computer networking, a hostname (archaically nodename[1]) is a label that is assigned to a device connected to a computer network and that is used to identify the device in various forms of electronic communication such as the World Wide Web, e-mail or Usenet. Hostnames may be simple names consisting of a single word or phrase, or they may be structured. On the Internet, hostnames may have appended the name of a Domain Name System (DNS) domain, separated from the host-specific label by a period ("dot"). In the latter form, a hostname is also called a domain name.
Example of output:
wikipedia [1]
wiki [0]
Hostname [1]
Here is a possible solution:
str = 'http://en.wikipedia.org/wiki/Hostname'
Paragraph = 'Hostname From Wikipedia, the free encyclopedia Jump to: navigation, search In computer networking, a hostname (archaically nodename[1]) blah blah'
SplitStrings = regexp(str,'[/.]','split')
c = containers.Map;
for it = SplitStrings
c( it{1} ) = strfind(Paragraph, it{1} )
end
Issues:
You will need to find out a way of including relevant and irrelevant parts of the URL. Currently, it takes http and en as valid parts of string.
You will need to see if you want the case to be respected or not.
It is algorithmically inefficient since it is making as many passes through the data as keywords. I will think about improving on this.

Comparing two email address lists anonymously

Given two lists:
Company A:
user1#example.com
user2#example.com
user3#example.com
user4#example.com
Company B:
user2#example.com
user4#example.com
user5#example.com
Is there a way to anonymously compare them to get the number of email addresses in common (i.e., 2) without either company knowing which addresses were the ones in common?
Background:
Let's say that company A and company B want to know what portion of their userbase is common. For simplicity, they are just going to base it on email address and not concern themselves with people who use multiple addresses or different address variations (user+misc#example.com).
For the sake of privacy, neither company can give the other the plain list of email addresses. If they used the same simple hash, e.g. MD5, each company could easily know which members were in common (not desired). If they used a hash salted with a company specific secret, the addresses wouldn't be comparable any longer so the question couldn't be answered.
Is there some trick using key encryption or some other mathematical way to accomplish what I'm looking to do?
I believe this question could be understood better in the realm of cryptography.
It is a problem of secure multi-party computation.
I'm not aware of any bullet proof solution for this problem but I can think of the following:
Choose a commutative hash function (H):
H(H(string, seed1), seed2) = H(H(string, seed2), seed1)
Each party (Company A and Company B) has to choose a secret seed:
SEED_A, SEED_B
Company A hashes all email addresses using SEED_A, Company B hashes all email addresses using SEED_B.
They interchange the hashes.
Each company applies the hash function again on the set received from the opposing party.
At this point the data should already be garbled and the companies should not be able to recognize their own email addresses (since they've been already hashed twice - the second time with an unknown key).
All the email addresses should be laid out openly and those that have the same hash should be counted as the email addresses that belong to both companies (except that neither company can tell the source of the hash).
This is the theory. Hopefully I didn't miss anything and there are no flaws in the algorithm.
As for the implementation, here's the most trivial PHP script that I could come with:
$a = array("user1#example.com", "user2#example.com", "user3#example.com", "user4#example.com");
$b = array("user2#example.com", "user4#example.com", "user5#example.com");
function enc($str, $seed) {
for ($i = strlen($str) - 1; $i >= 0; $i--) {
$str[$i] = $str[$i] ^ $seed[$i % strlen($seed)];
}
return $str;
}
/* Company A */
$hashesForB = array();
$SEED_A = 'SALT FOR COMPANY A';
foreach ($a as $address) {
$hashesForB[] = enc($address, $SEED_A);
}
/* Company B */
$hashesForA = array();
$SALT_B = 'THIS IS THE SALT FOR COMPANY B';
foreach ($b as $address) {
$hashesForA[] = enc($address, $SALT_B);
}
/* Company A */
$hashesForB_2 = array();
foreach ($hashesForA as $hash) {
$hashesForB_2[] = enc($hash, $SEED_A);
}
/* Company B */
$hashesForA_2 = array();
foreach ($hashesForB as $hash) {
$hashesForA_2[] = enc($hash, $SALT_B);
}
$common = count(array_intersect($hashesForA_2, $hashesForB_2));
print $common; // it will output 2
Click here for the DEMO
As you can see in the code above, I used the XOR algorithm for (pseudo) hashing (actually, any addition based hash function should do the job).
Obviously, this is not the best choice for many reasons:
XOR will return the original input upon a new call with the same salt
the entropy is not the best you could hope for
the data is not truncated
Still, you could implement your own hashing function using the suggestions here, here, here or here.
Is the privacy concern that privacy agreement prohibit sharing of email addresses? or is it a competitive concern?
If you just want to get an idea of percentage of overlap, then I'd think a simple encoding of the email addresses might work. For example, de-dupe each list, Base64 encode each email address, then run the comparison to get overlap, then report on the numbers.
A simple NDA could make this a less technical problem.
It depends the language you want to use.
In python, you could use this script :
listA = ('user1#example.com', 'user2#example.com', 'user3#example.com')
listB = ('user1#example.com', 'user2#example.com')
result = [x for x in listA if x in listB]
print(len(result))
For security, you could host this script in an external server where both companies just can put in their lists and then check the result.

Feasibility of extracting arbitrary locations from a given string? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I have many spreadsheets with travel information on them amongst other things.
I need to extract start and end locations where the row describes travel, and one or two more things from the row, but what those extra fields are shouldn't be important.
There is no known list of all locations and no fixed pattern of text, all that I can look for is location names.
The field I'm searching in has 0-2 locations, sometimes locations have aliases.
The Problem
If we have this:
00229 | 445 | RTF | Jan | trn_rtn_co | Chicago to Base1
00228 | 445 | RTF | Jan | train | Metroline to home coming from Base1
00228 | 445 | RTF | Jan | train_s | Standard train journey to Friends
I, for instance (though it will vary), will want this:
RTF|Jan|Chicago |Base1
RTF|Jan|Home |Base1
RTF|Jan|NULL |Friends
And then to go though, look up what Base1 and Friends mean for that person (whose unique ID is RTF) and replace them with sensible locations (assuming they only have one set of 'friends'):
RTF|Jan|Chicago |Rockford
RTF|Jan|Home |Rockword
RTF|Jan|NULL |Milwaukee
What I need
I need a way to pick out key words from the final column, such as: Metroline to home coming from Base1.
There are three types of words I'm looking for:
Home LocationsThese are known and limited, I can get these from a list
Home AliasesThese are known and limited, I can get these from a list
Away LocationsThese are unknown but cities/towns/etc in the UK I don't know how to recognize these in the string. This is my main problem
My Ideas
My go to program I thought of was awk, but I don't know if I can reliably search to find where a proper noun (i.e. location) is used for the location names.
Is there a package, library or dictionary of standard locations?
Can I get a program to scour the spreadsheets and 'learn' the names of locations?
This seems like a problem that would have been solved already (i.e. find words in a string of text), but I'm not certain what I'm doing, and I'm only a novice programmer.
Any help on what I can do would be appreciated.
Edit:
Any answer such as "US_Locations_Cities is something you could check against", "Check for strings mentioned in a file in awk using ...", "There is a library for language X that will let a program learn to recognise location names, it's not RegEx, but it might work", or "There is a dictionary of location names here" would be fine.
Ultimately anything that helps me do what I want to do (i.e get the location names!) would be excellent.
Sorry to tell you, but i think this is not 100% programmable.
The best bet would be to define some standard searches:
Chicago to Base1
[WORD] to [WORD]:
where "to" is fixed and you look for exactly one word before and after. the word before then is your source and word after your target
Metroline to home coming from Base1
[WORD] to [WORD] coming from [WORD]:
where "to" and "coming from" is fixed and you look for three words in the appropriate slots.
etc
if you can match a source and target -> ok
if you cannot match something then throw an error for that line and let the user decide or even better implement an appropiate correction and let the program automatically reevaluate that line.
these are non-trivial goals.
consider:
Cities out of us of a
Non english text entries
Abbreviations
for automatic error corrections try to match the found [WORD]'s with a list of us or other cities.
if the city is not found throw an error. if you find that error either include that not found city to your city list or translate a city name in a publicly known (official) name.
The best I can suggest is that, as long as your locations are all US cities, you can use a database of zip codes such as this one.
I don't know how you expect any program to pick up things like Friends or Base1
I have to agree with hacktick that as it stands now, it is not programmable. It seems that the only solution is to invent a language or protocol.
I think an easy implementation follows:
In this language you have two keywords: to and from (you could also possibly allocate at as a keyword synoym for from as well).
These keywords define a portion of string that follows as a "scan area" for
recognizing names
I'm only planning on implementing the simplest scan, but as indicated at the end of the post allows you to do your fallback.
In the implementation you have a "Preferred Name" hash, where you define the names that you want displayed for things that appear there.
{ Base1 => 'Rockford'
, Friends => 'Milwaukee'
, ...
}
You could split your sentences by chunks of text between the keywords, using the following rules:
A. First chunk, if not a keyword is taken as the value of 'from'.
A. On this or any subsequent chunk, if keyword then save the next chunk
after that for that value.
A. Each value is "scanned" for a preferred phrase before being stored
as the value.
my #chunks
= grep {; defined and ( s/^\s+//, s/\s+$//, length ) }
split /\b(from|to)\s+/i, $note
;
my %parts = ( to => '', from => '' );
my $key;
do {
last unless my $chunk = shift #chunks;
if ( $key ) {
$parts{ $key } = $preferred_title{ $chunk } // $chunk;
$key = '';
}
elsif ( exists $parts{ lc $chunk } ) {
$key = lc $chunk;
}
elsif ( !$parts{from} ) {
$parts{from} = $preferred_title{ $chunk } // $chunk;
}
} while ( #chunks );
say join( '|', $note, #parts{ qw<from to> } );
At the very least, collecting these values and printing them out can give you a sieve to decide on further courses of action. This will tell you that 'home coming' is perceived as a 'from' statement, as well as 'Standard train journey'.
You *could fix the 'home coming' by amending the regex thusly:
/\b(?:(?:coming )?(from)|(to))\s+/i
And we could add the following key-value pair to our preferred_title hash:
home => 'Home'
We could simply define 'Standard train journey' => '', or we could create a list of rejection patterns, where we reject a string as a meaningful value if they fit a pattern.
But they allow you to dump out a list of values and refine your scan of data. Another idea is that as it seems that your pretty consistent with your use of capitals (except for 'home') for places. So we could increase our odds of finding the right string by matching the chunk with
/\b(home|\p{Upper}.*)/
Note that this still considers 'Standard train journey' a proper location. So this would still need to be handled by rejection rules.
Here I reiterate that this can be a minimal approach to scanning the data to the point that you can make sense of what it this system takes to be locations and "80/20" it down: that is, hopefully those rules handle 80 percent of the cases, and you can tune the algorithm to handle 80 percent of the remaining 20, and iterate to the point that you simply have to change a handful of entries at worst.
Then, you have a specification that you would need to follow in creating travel notes from then on. You could even scan the notes as they were entered and alert something like
'No destination found in note!'.