How to do a name similarity using clustering - cluster-analysis

I have a very big -super big- database of names.
The task is to find all the similar names (of the same person per se) despite some diffrences like :
first name, second name inversed --> John Doe & Doe John
two names or more (same ones) with light changes, maybe some
letters misplaced or something else --> Jonh Doe & John Deo
two names with some letters added --> Johhn Doe & Johnn Doee &
John Doe
names where another middle name inserted --> John Blair Campbell Doe & John Blair Doe
And so on..
I tried using the classical methods like soundex and leveshtein but the results were not very good, had results like : Amine depi and Amina dope are in the same group while they're diffrent
and It would take very long to perform the task on just a fraction on the data, as for my database, it would directly crash after a long time
I also thought of using another approach like cosine which uses numerical values and I though of finding a way of representing the names in a numerical way, or convert them (something like word2vec), I actually though of using directly word2vec with the whole database of namems as the text, but as expected it didn't work. Tried to codify the names in a low level way, like code ASCII for exemple, but the results weren't good neither.
So I thought of Clustering.
So I tried using DBSCAN. I found a way to use DBSCAN clustering with a custom distance metric and used leveshtein distance. (If you ask me why DBSCAN? It is because I don't know the numbers of similar groups of names which are in the database in the beginning)
I did have some results, but very poor performance overall. It would either give the same exact ones, John Doe and John Doe int he same cluster, or nothing at all, and would even skip some exact ones.
Do you have a suggestion for performing this task ? preferbly using clsutering or another smart way since the database is very big (more than 500 000 line and up to millions ) so I cannot iterate alot.
I am open to suggestions or propositions !
Especially if you worked on something like this previously or similar to this, Thank you in advance.

Try AgglomerativeClustering.
Sample code:
clustering = AgglomerativeClustering(
n_clusters=None,
distance_threshold=0.3 # smaller threshold meaning more strict similarities, and more clusters
).fit(your_vectorized_name_list)
print(f'total clusters: {clustering.n_clusters_}')

Related

I need help in data sanitization problem in tableau

I trying doing the manual sanitization, however I am getting a type mismatch error in performing the calculations.
I also need help in sanitizing the data and getting the insight as per the below instructions:
The column sellerproductcount gives you the count of products in the
form '1-16 of over 100,000 results' , and you can parse out the product count 100,000.
sellerratings - this columns gives you the % and count of positive ratings (e.g. 88% positive
in the last 12 months (118 ratings) ) if parsed correctly
sellerdetails - you can use this text to parse out phone numbers, and email IDs of
merchants, where available, so our team can reach out to them.
businessaddress - this will give you the business locations of the sellers. You can parse them
to identify if a seller is registered in the US , Germany (DE), or China (CN).
Hero Product 1 #ratings and Hero Product 2 #ratings - these 2 columns give you the number of
ratings of the 2 'hero products' or bestselling products of this seller.
I have attached the dataset for the same.
https://docs.google.com/spreadsheets/d/1PSqRCnmFgq7v7RzZaCXXoV0Edp_vM7QO/edit?usp=sharing&ouid=115547990006782902200&rtpof=true&sd=true
Most of this type of data prep can be done with string & RegEx functions like REGEX_MATCH(). Here are a few examples based on the data you shared:
Seller Product Count
INT(REGEXP_EXTRACT([Sellerproductcount], '(\d*,?\d*) results'))
1-16 of over 6,000 results >> 6000
Seller Rating (Percentage)
INT(REGEXP_EXTRACT([Sellerratings], '(\d*)% positive'))
92% positive in the last 12 months (181 ratings) >> 92
Seller Rating (Count)
INT(REGEXP_EXTRACT([Sellerratings], '(\d*) (?:total )?ratings'))
92% positive in the last 12 months (181 ratings) >> 181
Business Country Code
RIGHT([Businessaddress],2)
AM Treptower Park28-30Berlin12435DE >> DE
These examples all have very straightforward patterns that are present in all rows so they can be done pretty easily with one simple calculation. However, something like sellerdetails which is unstructured, inconsistent, and sometimes incomplete will be a bit more of a challenge. You will need to use a couple of different calculations and techniques combined together to find what you are looking for, as well as some manual data prep. Here's an example of how you can pull out email but it won't work for everything:
Email
REGEXP_EXTRACT([Sellerdetails], '([a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+#[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*)')
Good luck with your data cleaning, I suggest using sites like https://regex101.com/ and https://regexr.com/ to learn more about and help test regular expressions.

Is this table in first normal form?

I am currently studying SQL normal forms.
Lets say I have the following table the primary key is userid
userid FirstName LastName Phone
1 John Smith 555-555
1 Tim Jack 432-213
2 Sarah Mit 454-541
3 Tom jones 987-125
The book I'm reading states the following conditions must be true in order for a table to be in 1st normal form.
Rows contain data about an entity.
Columns contain data about attributes of the entities.
All entries in a column are of the same kind.
Each column has a unique name.
Cells of the table hold a single value.
The order of the columns is unimportant.
The order of the rows is unimportant.
No two rows may be identical.
A primary key Must be assigned
I'm not sure if my table violates the
8th rule No two rows may be identical.
Because the first two records in my table
1 John Smith 555-555
1 Tim Jack 432-213
share the same userid does that mean that they are considered
duplicate rows?
Or does duplicate records mean that every peace of data in the row
has to be the same for the record to be considered a duplicate row
see example below?
1 John Smith 555-555
1 John Smith 555-555
EDIT1: Sorry for the confusion
The question I was trying to ask is simple
Is this table below in 1st normal form?
userid FirstName LastName Phone
1 John Smith 555-555
1 Tim Jack 432-213
2 Sarah Mit 454-541
3 Tom jones 987-125
Based on the 9 rules given in the textbook I think it is but I wasn't sure that
if rule 8 No two rows may be identical
was being violated because of two records that use the same primary key.
The class text book and prof isn't really that clear on this subject which is why I am asking this question.
Or does duplicate records mean that every peace of data in the row has to be the same for the record to be considered a duplicate row see example below?
They mean that--the latter of your choices. Entire rows are what must be "identical". It's ok if two rows share the same values for one or more columns as long as one or more columns differ.
That's because a relation holds a set of values that are tuples/rows/records, and set is a collection of values that are all different.
But SQL & some relational algebras have different notions of "identical" in the case of NULLs compared to the relational model without NULLs. You should read what your textbook says about it if you want to know exactly what they mean by it. Two rows that have NULL in the same column are considered different. (Point 9 might be summarizing something involving NULLs. Depends on the explanation in the book.)
PS
There's no single notion of what a relation is. There is no single notion of "identical". There is no single notion of 1NF.
Points 3-8 are better described as (poor) ways of restricting how to interpret a picture of a table to get a relation. Your textbook seems to be (strangely) making "1NF" a property of such an interpretation of a picture of a table. Normally we simply define a relation to be a certain thing so if you have one then it has to have the defined properties. Then "in 1NF" applies to a relation & either means "is a relation" & isn't further used or it means certain further restrictions hold. A relation is a set of tuples/rows/records, and in the kind of relation your 3-8 describes they are sets of attribute/column/field name-value pairs & the values paired with a name have to be of the type paired with that name in some schema/heading that is a set of name-type pairs that is defined either as part of the relation or external to it.
Your textbook doesn't seem to present things clearly. It's definition of "1NF" is also idiosyncratic in that although 3-8 are mathematical, 1 & 2 are informal/heuristic (& 9 could be either or both).

Word Master-Detail Mail Merge

I searched a lot to find out how to do a "master-detail" style mail merge in word but i couldn't find an appropriate answer/tutorial.
my data look like this:
id name property
-----------------------------
1 John Doe employed
1 John Doe married
1 John Doe male
2 Don Joe employed
2 Don Joe single
2 Don Joe male
and the result should look like this:
with a page-break between every key record.
Does anyone know how this can be achieved through a Microsoft Word Mail Merge?
(i know SSRS,etc.. are better tools for this, but i have to use MS Word because on the client this is the only thing that's possible)
Yes - this is possible using WORDs mail merge.
I don't think you can do it with your data the way it is now. There's not much that can be done to manipulate the data before it gets to the mail merge.
You would need to clean up the data to combine the properties for each person into one field before you can do the merge.
Is your data in Excel? You can write some VBA to combine the properties field.
If it's in a database, you can use SQL to combine the fields using For XML Path.
Once your data is cleaned up, you would be able to use the Mail Merge Wizard.
For the Doc Type, select Letter and use a blank page (current document).
Then select your data.
The next step is to Write Your Letter. Type your field name (i.e. **Name: **) and click on the Insert Merge Field and select the field to insert.
Repeat for all you fields.

Counting multiple values from one column in Tableau

I have a field from the data I am reading in that can contain multiple values. They are essentially tags.
For example, there could be a column called "persons responsible". This could read "Joe; Bob; Sue" or "Sue" for a given row.
Is it possible from within Tableau to read these in as separate categories? So that for this sample data:
Project | Persons
---------------------------
Zeta | Bob; Sue; Joe
Enne | Sue
Doble Ve | Bob
There could be a count of Bob (2), Sue (2), Joe (1)?
I am working on getting better data inputs, but I was wondering if there was a temporary solution at this level.
I would definitely work towards normalizing your schema.
In the meantime, there is a workaround that is almost reasonable if there is a small set of possible values for the tags (persons in your example).
If Bob, Sue and Joe are the only people in the system, you can use the contains() function to define a boolean calculated field for each person -- e.g. Bob_Is_Responsible = contains(Persons, 'Bob"), and similar fields for Sue and Joe. Then you could use those as building blocks, possibly with sets, to break the data up in different ways.
Of course, this approach gets cumbersome fast if the number of tags grows, or if it is unconstrained. But you asked for a temporary solution ...
If the number of elements is small, you write and union several queries with each one having the project and nth element.
Ideally, you'd reshape your data to look like this either in the database or with the above mentioned union technique. Then you could count() or countd() the elements by project.
Project | Persons
---------------------------
Zeta | Bob
Zeta | Sue
Zeta | Joe
Enne | Sue
Doble Ve | Bob

Working with a delimited list of items in a Tableau field

I am preparing a data visualization in Tableau.
I have some data that can be simplified like this:
Name, Score, Tag
Joe, 5, A;B
Phil, 7, D
Quinn, 9, A;C
Bill, 3, A;B;C
I would like to generate a word cloud on the Tag field that counts
occurances of each item A,B,C. So I need to generate this:
A,3
B,2
C,2
D,1
In other words, I need help working with a field that contains a list of delimited values.
In the example data ; is the delimiter, but it could be anything.
I would like the word cloud to update as the user
applies filters, e.g. dragging a slider to set score > 5.
So the tag count has to be done on the fly.
I'm pretty sure I'll need to use field calculations and table calculations..?
Possibly I'll need to have a separate table tracking the tags..?
I have no problem building the word cloud and other viz elements.
What I'm looking for help with is parsing the delimited list field and
calculating the tag counts.
I do have full control over the source data, so if there is an easier way to
do this by reorganizing the schema, I'd be glad to do that. I thought of breaking
the field up into spearate tag1, tag2, tagX fields and trying to count over the
separate fields... but not sure if this is any simpler.
Thanks for any tips.
Another (probably better in your case) approach is to reshape the data before feeding it to Tableau. Tableau works best with normalized data.
Preprocess it to look like:
Name, Score, Tag
Joe, 5, A
Joe, 5, B
Phil, 7, D
Quinn, 9, A
Quinn, 9, C
Bill, 3, A
Bill, 3, B
Bill, 3, C
At that point, the standard Tableau word cloud charts should work well, and it will scale easily as you add more tags and data.
Reshaping data to normalize it prior to analysis with Tableau is a pretty standard step. Sometimes you can do it automatically, say with custom SQL, but often you'll have to use some sort of script first. If your data comes from Excel, Tableau has a plug in that can help with reshaping data. Look for it on the Tableau knowledge base.
Here's an approach that would be tolerable if you had a fixed set of 3 or 4 tags. Since you have closer to 50K possible tags, it's not a feasible approach for your problem as is. But maybe it will give you an idea. Similar approaches can be used to solve different kinds of problems in Tableau, so its a useful trick to know.
For each tag, create a boolean calculated field that returns 1 if the current row contains that particular tag and null otherwise (or whatever the condition is you want to detail)
For example, define a calculated field called Tag_A defined as:
if contains(Tag, "A") then
1
end
Similar, define calculated fields Tag_B, Tag_C etc
So far it's easy.
Then you can use those fields in other calculations to count the number of records that contain tag A, filter to only those that contain A, use the calculated field on the condition tab when defining sets that are computed dynamically by a formula ... Of course, the low level calculated field function can be more complex, say checking for the presence of at least 2 fields out of a list for example.
If nothing else, this approach sometimes lets you break complex problems into bite sized pieces.
Unfortunately, hard coding calculated field names won't scale to 50K tags. For that, you probably want to reshape your data.