String pattern matching using unreliable data - neural-network

I know that there are several threads about string pattern matching, but I feel like my circumstances are slightly different.
I have a list of user-entered claim numbers, each claim number is unique. Each claim number can be in varying formats, it all depends on the user and how much of the actual full claim number they give us. We don't know what the insurance carrier's claim format is, they won't share it with us.
Let's assume that our users have entered the following claim numbers
Insurance Carrier X claim numbers:
04756G215
04759Q696
04760G279
04760T844
00631F546
006G34549
006J73029
Insurance Carrier Y claim numbers:
000628948-014
01-VK4994-0
01-VW6183-4
01-WC20436
12082356
01VL0063-6
01WB16121
03-016298-2
03-165476-3
1000-66-0792
1000-66-3808
1000-67-8667
1000-68-1360
1000-68-1686
1000-68-8494
1000-69-5647
1000-69-6905
Insurance Carrier Z:
42RBB903752
444F09799
51RBB672507
51RBC153279
55RBB120866
55RBB339718
As you can see, different formats. Also, I am certain that I can't rely on the user to enter the correct one, they often omit parts of it because that part may have some claim office code in it, we simply don't know.
Knowing this, I want to enter a claim number in a system that tells me which is the likely carrier that it belongs to
So 55RBB339719 would most likely give me carrier Z.
Are neural networks the way to go? Fuzzy Logic?
UPDATE:
Here's a string pattern to match
51RAB435220
As you can see it's the same pattern (2 digits, 3 characters, 6 digits).
However, a user can enter RAB435220 because the first two digits may be insignificant, meaning that they may be a department code rather than being the actual claim numbers. It's possible that only the last 6 digits are the significant ones.
What makes it difficult is that we don't know what the significant digits are.

Related

Imbalanced multiclass classification using company names

I have this classification scenario below in which Im getting a very low F1, precision, recall and other metrics.
Target is multiclass (about ~200 classes) which is highly imbalanced
I only use company names as classifier (mostly 1-2 words which have max of 8 words), no other fields (like description, etc.)
Training data ~ 100k+ records
Preprocessing: numeric and special characters and stopwords removal
I have very low resources for processing (thats why when I try to use oversampling techniques like smote, distance_smote for multiclass, etc., I always get memory error)
Tried using different vectorization/embedding/tokenizer like word2vec, tfidf, fasttext, bert, roberta, etc. but to no avail
Tried using (and fine-tuning) different algorithms (networks, svm, trees, boosting, etc.) but also getting low scores.
I also did cost-sensitive learning (using class weights) but it only decreased my scores.
Tried all options that I know but scores are not increasing. Can you recommend other options here or do you think any part of the process that may be wrong/discarded? Thank you!
Distribution of target labels:
Sample observations
There is essentially no way to know that 'Exxon' is an oil company, and 'Apple' a computer company, and 'McDonalds' a fast-food chain, just from their company names.
Even if you have a list of every other company in the world, by name and type, that's not enough to make the deduction for these last 3. Only other outside info – like a few sentences about them, or other data – could classify them.
In fact, while company names sometimes describe their exact field-of-commerce, often they're totally arbitrary, as that gives them more freedom to range over many products/services, or create their own unique associations with the name (aka branding).
So I strongly suspect your (unshown) names & (unshown) labels are just too arbitrary for the data you're using to get very good at the task you're attempting.
Is there a real-world situation where someone will only have a company name – no other info, or research options – and benefit from correctly guessing the class? If so, more specifics about the situation might help generate more specific tactical recommendations. But mainly such recommendations will be: get richer data about the targets of the classification.
You might squeeze a little more out of vague trends in corporate naming via better preprocessing/feature-extraction. You may want to keep numbers, special-characters, & punctuation in some form, as they might include extra slight hints. Using subwords (character n-grams) might also reveal some shared word-roots used even in made-up names.

Picking a check digit algorithm

I am generating random OTP-style strings that serve as a short-term identifier to link two otherwise unrelated systems (which have authentication at each end). These need to be read and re-entered by users, so in order to reduce the error rate and reduce the opportunities for forgery, I'd like to make one of the digits a check digit. At present my random string conforms to the pattern (removing I and O to avoid confusion):
^[ABCDEFGHJKLMNPQRSTUVWXYZ][0-9]{4}$
I want to append one extra decimal digit for the check. So far I've implemented this as a BLAKE2 hash (from libsodium) that's converted to decimal and truncated to 1 char. This gives only 10 possibilities for the check digit, which isn't much. My primary objective is to detect single character errors in the input.
This approach kind of works, but it seems that one digit is not enough to detect single char errors, and undetected errors are quite easy to find, for example K37705 and K36705 are both considered valid.
I do not have a time value baked into this OTP; instead it's purely random and I'm relying on keeping a record of the OTPs that have been generated recently for each user, which are deleted periodically, and I'm reducing opportunities for brute-forcing by rate and attempt-count limiting.
I'm guessing that BLAKE2 isn't a good choice here, but given there are only 10 possibilities for the result, I don't know that others will be better. What would be a better algorithm/approach to use?
Frame challenge
Why do you need a check digit?
It doesn't improve security, and a five digits is trivial for most humans to get correct. Check if server side and return an error message if it's wrong.
Normal TOTP tokens are commonly 6 digits, and actors such as google has determined that people in general manage to get them orrect.

Identity Hash Code for short lower-case strings

1) I am going to be receiving sets of unique user ids, with between 1 and 300 entries in each set.
2) Each unique user id is 6-10 characters long, and all lower case a-z ASCII
3) I need to generate a unique identifier for each set, max length 256 characters in the space
4) That unique identifier has, AFAIK, no bounds beyond "valid ASCII".
5) If two different sets generate the same unique group id, people in two unrelated groups will be able to see each others' information and I get to explain to Compliance why I thought that was a good idea. Let's call this the hashcode problem
6) If two identical sets don't produce the same id, a bunch of information will suddenly vanish for those users, and I will get to explain to Business why I thought that was a good idea. Let's call this the random guid problem.
Is there an identity hash that will meet my needs? Something like SHA-256 seems like it would be "safe enough", but it feels like the limited input space and the fact that there's no problem with the hash being reversible (compression not cryptography), would mean there's a compression function that could do the job.

hashing strings

I have streaming strings (text containing words and number).
Taking one line at a time for streaming strings, I would like to assign a unique value to them.
the examples may be:strings with their scores/hash
User1 logged in Comp1 port8087 1109
User2 logged in comp2 1135
user3 logged in port8080 1098
user1 logged in comp2 port8080 1178
these string should be in same cluster. For this what i have thought is mapping(bad type of hashing) the strings such that the small change in the string wont affect the score that much.
One simple way of doing that may be: taking UliCp8, Ulic .... ( i.e. 1st letter of each sentence) and find some way of scoring. After then the similar scored strings are kept in same bucket and later on sub group them.
The improved method would be: lets not take out first word of each word of the string but find some way to take representative value of the word such that the string representation may be quite suitable for mapping with score/hash as i mention.
Considering Levenstein distance or jaccard_index or some similarity distance metrices, all of them require inputting the strings for comparisions. Isn't there any method to hash/score the string as stated without going for comparisions.( POS tagging, comparing looks uneffiecient for my purpose as the data are streaming, huge in number, unstructured)
Hope you understand what i want to achieve and please help me out. Forgot about the comments below and lets restart.
"at least two similar word (not considering length) should have similar hash value"
This is against the most basic requirements for a hash function. With a hash function also minimal changes to the input should produce vehement changes to the bucket the hash falls into.
You are looking for an algorithm that calculates the similarity or distance between two inputs.
As stated you are not looking for a hash function, rather something like the Levenshtein distance which is an algorithm for calculating a metric representing the degree of differences between two sequences of bits. It is commonly used to find out how similar/dissimilar two strings are. Hashing / message digests are good for creating identifiers for unique, distinct values but they will produce entirely different results for "similar" values.
You are interested in the similarity of strings. Here is a nice post that names a few resources that are used for measuring string similarity. Maybe Lucene could help you in your situation.

How can I generate a unique, small, random, and user-friendly key?

A few months back I was tasked with implementing a unique and random code for our web application. The code would have to be user friendly and as small as possible, but still be essentially random (so users couldn't easily predict the next code in the sequence).
It ended up generating values that looked something like this:
Af3nT5Xf2
Unfortunately, I was never satisfied with the implementation. Guid's were out of the question, they were simply too big and difficult for users to type in. I was hoping for something more along the lines of 4 or 5 characters/digits, but our particular implementation would generate noticeably patterned sequences if we encoded to less than 9 characters.
Here's what we ended up doing:
We pulled a unique sequential 32bit id from the database. We then inserted it into the center bits of a 64bit RANDOM integer. We created a lookup table of easily typed and recognized characters (A-Z, a-z, 2-9 skipping easily confused characters such as L,l,1,O,0, etc.). Finally, we used that lookup table to base-54 encode the 64-bit integer. The high bits were random, the low bits were random, but the center bits were sequential.
The final result was a code that was much smaller than a guid and looked random, even though it absolutely wasn't.
I was never satisfied with this particular implementation. What would you guys have done?
Here's how I would do it.
I'd obtain a list of common English words with usage frequency and some grammatical information (like is it a noun or a verb?). I think you can look around the intertubes for some copy. Firefox is open-source and it has a spellchecker... so it must be obtainable somehow.
Then I'd run a filter on it so obscure words are removed and that words which are too long are excluded.
Then my generation algorithm would pick 2 words from the list and concatenate them and add a random 3 digits number.
I can also randomize word selection pattern between verb/nouns like
eatCake778
pickBasket524
rideFlyer113
etc..
the case needn't be camel casing, you can randomize that as well. You can also randomize the placement of the number and the verb/noun.
And since that's a lot of randomizing, Jeff's The Danger of Naïveté is a must-read. Also make sure to study dictionary attacks well in advance.
And after I'd implemented it, I'd run a test to make sure that my algorithms should never collide. If the collision rate was high, then I'd play with the parameters (amount of nouns used, amount of verbs used, length of random number, total number of words, different kinds of casings etc.)
In .NET you can use the RNGCryptoServiceProvider method GetBytes() which will "fill an array of bytes with a cryptographically strong sequence of random values" (from ms documentation).
byte[] randomBytes = new byte[4];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(randomBytes);
You can increase the lengh of the byte array and pluck out the character values you want to allow.
In C#, I have used the 'System.IO.Path.GetRandomFileName() : String' method... but I was generating salt for debug file names. This method returns stuff that looks like your first example, except with a random '.xyz' file extension too.
If you're in .NET and just want a simpler (but not 'nicer' looking) solution, I would say this is it... you could remove the random file extension if you like.
At the time of this writing, this question's title is:
How can I generate a unique, small, random, and user-friendly key?
To that, I should note that it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are—
unique,
easy to type by end users (including small), and
hard to guess (including random).
Important points you don't mention in the question include:
How will the key be used?
Are other users allowed to access the resource identified by the key, whenever they know the key? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Note that I don't go into the issue of formatting a unique value into a "user-friendly key". There are many ways to do so, and they all come down to mapping unique values one-to-one with "user-friendly keys" — if the input value was unique, the "user-friendly key" will likewise be unique.
If by user friendly, you mean that a user could type the answer in then I think you would want to look in a different direction. I've seen and done implementations for initial random passwords that pick random words and numbers as an easier and less error prone string.
If though you're looking for a way to encode a random code in the URL string which is an issue I've dealt with for awhile then I what I have done is use 64-bit encoded GUIDs.
You could load your list of words as chakrit suggested into a data table or xml file with a unique sequential key. When getting your random word, use a random number generator to determine what words to fetch by their key. If you concatenate 2 of them, I don't think you need to include the numbers in the string unless "true randomness" is part of the goal.