Classification of gender for given names - classification

after some research I could not find yet a suitable open source library or software I can use to classify by most likely gender a long table of first names I have.
For my application I have a set of first names from many different countries, and many of them are also pretty exotic.
For example, when I tried to use Genderize I could get only 1/8 of the names classified, while the remaining are labeled as Unknown (I made sure that the format is correct, no lower/upper case ambiguity, etc..).
Any advise would be appreciated. Thank you in advance !

For the record, the best I could find was really just do it manually looking up names from google or dedicated websites such as https://namepedia.org. I am afraid there is no automated solution for my use case. This mostly for the following reasons:
Many names are somewhat archaic (I could not even recognise several names of my own nationality)
Many names were truncated to form nicknames or had two nearby letters swapped: here a LUT approach would fail and rather one would need a score from a model
There were several names not based on Roman alphabet but where the mapping into roman characters produced some ambiguities I guess
For those curious of the original dataset, this is part of a Kaggle challenge (Spaceship Titanic, https://www.kaggle.com/competitions/spaceship-titanic).

Related

Imbalanced multiclass classification using company names

I have this classification scenario below in which Im getting a very low F1, precision, recall and other metrics.
Target is multiclass (about ~200 classes) which is highly imbalanced
I only use company names as classifier (mostly 1-2 words which have max of 8 words), no other fields (like description, etc.)
Training data ~ 100k+ records
Preprocessing: numeric and special characters and stopwords removal
I have very low resources for processing (thats why when I try to use oversampling techniques like smote, distance_smote for multiclass, etc., I always get memory error)
Tried using different vectorization/embedding/tokenizer like word2vec, tfidf, fasttext, bert, roberta, etc. but to no avail
Tried using (and fine-tuning) different algorithms (networks, svm, trees, boosting, etc.) but also getting low scores.
I also did cost-sensitive learning (using class weights) but it only decreased my scores.
Tried all options that I know but scores are not increasing. Can you recommend other options here or do you think any part of the process that may be wrong/discarded? Thank you!
Distribution of target labels:
Sample observations
There is essentially no way to know that 'Exxon' is an oil company, and 'Apple' a computer company, and 'McDonalds' a fast-food chain, just from their company names.
Even if you have a list of every other company in the world, by name and type, that's not enough to make the deduction for these last 3. Only other outside info – like a few sentences about them, or other data – could classify them.
In fact, while company names sometimes describe their exact field-of-commerce, often they're totally arbitrary, as that gives them more freedom to range over many products/services, or create their own unique associations with the name (aka branding).
So I strongly suspect your (unshown) names & (unshown) labels are just too arbitrary for the data you're using to get very good at the task you're attempting.
Is there a real-world situation where someone will only have a company name – no other info, or research options – and benefit from correctly guessing the class? If so, more specifics about the situation might help generate more specific tactical recommendations. But mainly such recommendations will be: get richer data about the targets of the classification.
You might squeeze a little more out of vague trends in corporate naming via better preprocessing/feature-extraction. You may want to keep numbers, special-characters, & punctuation in some form, as they might include extra slight hints. Using subwords (character n-grams) might also reveal some shared word-roots used even in made-up names.

Determining canonical classes with text data

I have a unique problem and I'm not aware of any algorithm that can help me. Maybe someone on here does.
I have a dataset compiled from many different sources (teams). One field in particular is called "type". Here are some example values for type:
aple, apples, appls, ornge, fruits, orange, orange z, pear,
cauliflower, colifower, brocli, brocoli, leeks, veg, vegetables.
What I would like to be able to do is to group them together into e.g. fruits, vegetables, etc.
Put another way I have multiple spellings of various permutations of a parent level variable (fruits or vegetables in this example) and I need to be able to group them as best I can.
The only other potentially relevant feature of the data is the team that entered it, assuming some consistency in the way each team enters their data.
So, I have several million records of multiple spellings and short spellings (e.g. apple, appls) and I want to group them together in some way. In this example by fruits and vegetables.
Clustering would be challenging since each entry is most often 1 or two words, making it tricky to calculate a distance between terms.
Short of creating a massive lookup table created by a human (not likely with millions of rows), is there any approach I can take with this problem?
You will need to first solve the spelling problem, unless you have Google scale data that could allow you to learn fixing spelling with Google scale statistics.
Then you will still have the problem that "Apple" could be a fruit or a computer. Apple and "Granny Smith" will be completely different. You best guess at this second stage is something like word2vec trained on massive data. Then you get high dimensional word vectors, and can finally try to solve the clustering challenge, if you ever get that far with decent results. Good luck.

Efficiently extract WikiData entities from text

I have a lot of texts (millions), ranging from 100 to 4000 words. The texts are formatted as written work, with punctuation and grammar. Everything is in English.
The problem is simple: How to extract every WikiData entity from a given text?
An entity is defined as every noun, proper or regular. I.e., names of people, organizations, locations and things like chair, potatoes etc.
So far I've tried the following:
Tokenize the text with OpenNLP, and use the pre-trained models to extract people, location, organization and regular nouns.
Apply Porter Stemming where applicable.
Match all extracted nouns with the wmflabs-API to retrieve a potential WikiData ID.
This works, but I feel like I can do better. One obvious improvement would be to cache the relevant pieces of WikiData locally, which I plan on doing. However, before I do that, I want to check if there are other solutions.
Suggestions?
I tagged the question Scala because I'm using Spark for the task.
Some suggestions:
consider Stanford NER in comparison to OpenNLP to see how it compares on your corpus
I wonder at the value of stemming for most entity names
I suspect you might be losing information by dividing the task into discrete stages
although Wikidata is new, the task isn't, so you might look at papers for Freebase|DBpedia|Wikipedia entity recognition|disambiguation
In particular, DBpedia Spotlight is one system designed for exactly this task.
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38389.pdf
http://ceur-ws.org/Vol-1057/Nebhi_LD4IE2013.pdf

Fuzzy string matching: which tool?

I have a large number of strings containing a product name and a few other properties (size, volume, age, etc). But the strings are not standardized at all. Product names might be misspelled, volume might be in a different notation (0.5l, 1/2 liter, 500ml, etc). The number of variations is limited though, there are for instance only a few hundred products. What tools can I use to analyze each string and tell me if it contains certain tokens? My guess is that some sort of learning mechanism would be useful, but I'm not sure which tools would offer just that. I've looked at ElasticSearch, but I'm not sure if that's the way to go. All my data is currently in a PostgreSQL db and I've looked at pg_grm as well. Again, not sure if that fits my need.
One solution I've been thinking about is maintaining a list of proper keywords and, per string, see if the string contains any of the keywords. I'm not sure if this would work and, if it would, how to efficiently and effectively implement it in postgresql
EDIT
Here are a few example lines I'm trying to extract keywords from:
wine Bardolo red 1L 12b 12%
La Tulipe, 13* box 3 bottles, 2005
Great Johnny Walker 7CL 22% red label
Wisky Jonny Walken .7 Red limited editon
I've done quite some searching by now but have yet to find a proper way to solve this problem.
I've used pg_trgm extension for similar task (I was comparing misspelled address lines and company names) along with clustering algorithm (may be not needed in your case).
It's done it's job with some data preparations (regexp replacements).
May be not very easy but I'm sure it's possible to solve your problem too. And index support in pg_trgm is great.

Fastest method of checking if multiple different strings are a substring of a 2nd string

Context:
I'm creating a program which will sort and rename my media files which are named e.g. The.Office.s04e03.DIVX.WaREZKiNG.avi into an organized folder structure, which will consist of a list of folders for each TV Series, each folder will have a list of folders for the seasons, and those folders will contain the media files.
The problem:
I am unsure as to what the best method for reading a file name and determining what part of that name is the TV Show. For e.g. In "The.Office.s04e03.DIVX.WaREZKiNG.avi", The Office is the name of the series. I decided to have a list of all TV Shows and to check if each TV Show is a substring in the file name, but as far as I know this means I have to check every single series against the name for every file.
My question: How should I determine if a string contains one of many other strings?
Thanks
The Aho-Corsasick algorithm[1] efficiently solves the "does this possibly long string exactly contain any of these many short strings" problem.
However, I suspect this isn't really the problem you want to solve. It seems to me that you want something to extract the likely components from a string that is in one of possibly many different formats. I suspect that having a few different regexps for likely providers, video formats, season/episode markers, perhaps a database of show names, etc, is really what you want. Then you can independently run these different 'information extractors' on your filenames to pull out their structure.
[1] http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
It depends on the overall structure of the filenames in general, for instance is the series name always first? If so a tree structure work well. Is there a standard marking between words (period in your example) if so you can split the string on those and create a case-insensitive hashtable of interesting words to boost performance.
However extracting seasons and episodes becomes more difficult, a simple solution would be to implement an algorithm to handle each format you uncover, although by using hints you could create an interesting parser if you wanted too. (Likely overkill however)