Compute "substring" distances between sequences

Compute "substring" distances between sequences - traminer

My dataset (first line = header) is the following:
ID;Activity 1;Activity 2; ... ;Activity 20;
Company_X;A1A3T1D1O1R1R8;A1A3T2O1R2;...;A1A3T6D2O1O2R2
Company_Y;A1A3T1O1R1;A1A3T2O1R2;...;A1A3T11O1O3R5
Company Z;A1A3T1D8O1R1R8;A1A3T2O1R2;...;A1A3T6D2O1R2
where for each activity, each pair (one letter + one number) represents on part of a sequence. A1=actor1, A3=actor3, O1=object1. What I try to do is to compute the difference between the activities of companies. For instance the activity1 of company_x should have a difference of - e.g., 2 with the activity1 of company_y since they have in common A1A3T1O1R1 but not D1 and R8.
Can any packages in TraMineR do that? Which means comparing, within each event, a predefined number of chars?
Thank you very much for your help

From what I understand, each string (activity) like A1A3T6D2O1O2R2 should be considered as a sequence of pairs and you want to compare such sequences.
The seqdef function of TraMineR can read sequences in string form. However, when each element is defined by more than a single character, you have to introduce a separator (e.g., A1-A3-T6) for that. Then, to pair your sequences with company names you may also need to organize your data in table form with each sequence (activity) in a separate row, something like
ID Activity
company_x A1-A3-T6-D2-O1-O2-R2
company_y A1-A3-T1-O1-R1
...
Then, you can compute dissimilarities using measures applicable to sequences of different lengths. Optimal matching (OM), for instance, is the minimal cost of transforming one sequence into the other given the indel and substitution costs. This should give you what you expect. Depending on the substitution costs, the distance between A1A3T6D2O1O2R2 and A1A3T6D2O1R2, could be different than between A1A3T6D2O1O2R2 and A3T4

Related

How can I best construct data structures to retrieve similar values for demographic matching?

The job is person demographic matching/consolidation.
I have incoming person demographic information which I need to determine if it is a match against an existing person in the a dataset. I get the following data;
NAME_LAST VARCHAR2(40),
NAME_FIRST VARCHAR2(40),
NAME_MIDDLE VARCHAR2(40),
NAME_MAIDEN VARCHAR2(40),
RESIDENCE_ADDRESS VARCHAR2(60),
RESIDENCE_CITY VARCHAR2(50),
RESIDENCE_STATE VARCHAR2(2),
RESIDENCE_ZIP VARCHAR2(9),
RACE VARCHAR2(2),
DATE_OF_BIRTH DATE,
GENDER VARCHAR2(1),
TELEPHONE VARCHAR2(10),
SSN VARCHAR2(9)
The incoming and existing data can and does have typographic errors in any/all fields. I have written a probabilistic algorithm which will take an existing record, incoming record and score their similarity reasonably well (99.99%+).
The problem is performance. The match of two records is reasonably quick, but the dataset I need to match against currently has over 3.9 million rows. So obviously I can't try to match against all records in the dataset.
The common way around this is to limit searches using deterministic matches against limited subsets of the data (blocking). Soundex and double metaphone "hashing" is used on name fields, DOB is split into year and MMDD segments, and this blocking yields good results but unless I cast a wide net, I miss some matches. If I cast a wide net, the performance degrades.
So the questions are;
What types of "hashing" can I do, other than double metaphone & soundex, on the data elements which would be suitable for exact or range matching which would yield small subsets of data likely to contain the "best" match?
Is there a better approach to creating a suitable data structure for matching?
The data is contained in an Oracle DB 19c the main language at my disposal is PL/SQL.

You should either add your algorithm that makes a reasonable score or add additional information - against what input you should match.
For example:
RESIDENCE_CITY VARCHAR2(50),
RESIDENCE_STATE VARCHAR2(2),
RESIDENCE_ZIP VARCHAR2(9)
Should either not contains errors or those errors could be much easier detected and corrected.
In this case you can create index on these three columns and run your algorithm on those that matches exact (or matches after correction) these three columns.
So my suggestion would be - to divide original data on smaller groups that can be matched more precisly and then run you algorithm based on this smaller group.

postgres partiton char - values from to inclusive

Hello I have a tables called agents thats partitoned on name
Now I want to create a horizontal partitoning for names starting from g to z.
the problem that when I do like the below code piece names like 'zizo' dont find a table since the to statement is exclusive.
Also How to make it case insensitive??
CREATE TABLE agents_gz
PARTITION OF agents
FOR VALUES FROM ('^g') TO ('^z');

You could define the partition like this:
CREATE TABLE agents_gz
PARTITION OF agents
FOR VALUES FROM ('g') TO (MAXVALUE);
Specifying '^g' will only work accidentally, because many collations ignore special characters when comparing strings. But using 'g' as the lower bound is better.
For the upper bound you can use MAXVALUE, which means that the upper bound is defined as the maximum possible string.

Is there any way for Access 2016 to sort the numbers that are part of a "text" data type formatted field as though they are numeric values?

I am working on a database that (hopefully) will end up using a primary key with both numbers and letters in the values to track lots of agricultural product. Due to the way in which the weighing of product takes place at more than one facility, I have no other option but to maintain the same base number but use letters in addition to this base number to denote split portions of each lot of product. The problem is, after I create record number 99, the number 100 suddenly floats up and underneath 10. This makes it difficult to maintain consistency and forces me to replace this alphanumeric lot ID with a strictly numeric value in order to keep it sorted (which I use "autonumber" as the data type). Either way, I need the alphanumeric lot ID, and so having 2 ID's for the same lot can be confusing for anyone inputting values into the form. Is there a way around this that I am just not seeing?

If you're using query as a data source then you may try to sort it by string converted to number, something like
SELECT id, field1, field2, ..
ORDER BY CLng(YourAlphaNumericField)
Edit: you may also try Val function instead of CLng - it should not fail on non-numeric input

Why not properly format your key before saving ? e.g: "0000099". You will avoid a costly conversion later.
Alternatively, you could use 2 fields as the composite PK. One with the Number (as Long) and one with the Location (as String).

How does the hash part in hash maps work?

So there is this nice picture in the hash maps article on Wikipedia:
Everything clear so far, except for the hash function in the middle.
How can a function generate the right index from any string? Are the indexes integers in reality too? If yes, how can the function output 1 for John Smith, 2 for Lisa Smith, etc.?

That's one of the key problems of hashmaps/dictionaries and so on. You have to choose a good hash function. A very bad but fast hash function could be the length of the keys. You instantly see, that you will get a lot of collisions (different keys, but same hash). Another bad hash function could be the ASCII value of the first character of your key. Lot's of collisions, too.
So you need a function that is a lot better than those two. You could add (xor) all ASCII values of the key characters and mix the length in for instance. In practice you often depend on the values (fields) of the object that you want to hash (same values give same hash => value type). For reference types you can mix in a memory location for instance.
In your example that's just simplified a lot. No real hash function would map these keys to sequential numbers.
Maybe you want to read one of my previous answers to hashmaps

A simple hash function may be as follows:
$hash = $string[0] % HASH_TABLE_SIZE;
This function will return a number between 0 and HASH_TABLE_SIZE - 1, depending on the first letter of the string. This number can be used to go to the correct position in the hash table.
A real hash function will consider all letters in a string, and it will be designed so that there is an even spread among the buckets.

The hash function most often (but not necessarily always) outputs an integer within wanted range (often parameter to the hash function). This integer can be used as an index. Notice that hash function cannot be guaranteed to always produce unique result when given different data to hash. This is called hash collision and hash algorithm must always handle it in some way.
As for your specific question, how a string becomes a number. Any string is composed of characters (J, o, h, n ...) and characters can be interpreted as numbers (in computers). ASCII and UTF standards bind certain values to certain characters, so result is deterministic and always the same on all computers. So the hash function does operation on these characters that processes them as numbers and comes up with another number (output). You could for example simply sum all the values and use modulo operation to range-limit the resulting value.
This would be quite a horrible hashing function because for example "ab" and "ba" would get same result. Design of hash function is difficult and so one should use some ready-made algorithm unless situation dictates some other solution.

There's a really good article on how hash functions (and colision detection/resolution) on MSDN:
Part 2: The Queue, Stack, and Hashtable
You can skip down to the header Compressing Ordinal Indexing with a Hash Function
There are some bits and pieces that are .NET specific (when they talk about which Hash algorithm .NET uses by default) but for the most part it is language agnostic.

All that is required of a hash function is that it returns the same integer given the same key. Technically, a hash function that always returns '1' is not incorrect.

Pseudorandom seed methodology for lookup tables

Could someone please suggest a good way of taking a global seed value e.g. "Hello World" and using that value to lookup values in arrays or tables.
I'm sort of thinking like that classic spacefaring game of "Elite" where there were different attributes for the planets but they were not random, simply derived from the seed value for the universe.
I was thinking MD5 on the input value and then using bytes from the hash, casting them to integers and mod them into acceptable indexes for lookup tables, but i suspect there must be a better way? I read something about Mersenne twisters but maybe that would be overkill.
I'm hoping for something which will give a good distrubution over the values in my lookup tables. e.g. Red, Orange, Yellow, Green, Blue, Purple
Also to emphasize I'm not looking for random values but consistent values each time.
Update: Perhaps I'm having difficulty in expressing my own problem domain. Here is an example of a site which uses generators and can generate X number of values: http://www.seventhsanctum.com
Additional criteria
I would prefer to work from first principles rather than making use of library functions such as System.Random

My approach would be to use your key as a seed for a random number generator
public StarSystem(long systemSeed){
java.util.Random r = new Random(systemSeed);
Color c = colorArray[r.nextInt(colorArray.length)]; // generates a psudo-random-number based from your seed
PoliticalSystem politics = politicsArray[r.nextInt(politicsArray.length)];
...
}
For a given seed this will produce the same color and the same political system every time.
For getting the starting seed from a string you could just use MD5Sum and grab the first/last 64bits for your long, the other approach would be to just use a numeric for each plant. Elite also generated the names for each system using its pseudo-random-generator.
for(long seed=1; seed<NUMBER_OF_SYSTEMS; seed++){
starSystems.add(new StarSystem(seed));
}
By setting the seed to a known value each time the Random will return the same sequence every time it is called, this is why when trying for good random values a good seed is very important. However in your case a known seed will produce the results your looking for.
The c# equivalent is
public StarSystem(int systemSeed){
System.Random r = new Random(systemSeed);
Color c = colorArray[r.next(colorArray.length)]; // generates a psudo-random-number based from your seed
PoliticalSystem politics = politicsArray[r.next(politicsArray.length)];
...
}
Notice a difference? no, nor did I.

Many common random number generators will generate the same sequence given the same seed value, so it seems that all you need to do is convert your name into a number. There are any number of hashing functions that will do that.
Supplementary question: Is it required that all unique strings generate unique hashes and so (probably) unique pseudo-random sequences.
?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Compute "substring" distances between sequences - traminer

Related

How can I best construct data structures to retrieve similar values for demographic matching?

postgres partiton char - values from to inclusive

Is there any way for Access 2016 to sort the numbers that are part of a "text" data type formatted field as though they are numeric values?

How does the hash part in hash maps work?

Pseudorandom seed methodology for lookup tables

Categories

Resources