How can I best construct data structures to retrieve similar values for demographic matching? - match

The job is person demographic matching/consolidation.
I have incoming person demographic information which I need to determine if it is a match against an existing person in the a dataset. I get the following data;
NAME_LAST VARCHAR2(40),
NAME_FIRST VARCHAR2(40),
NAME_MIDDLE VARCHAR2(40),
NAME_MAIDEN VARCHAR2(40),
RESIDENCE_ADDRESS VARCHAR2(60),
RESIDENCE_CITY VARCHAR2(50),
RESIDENCE_STATE VARCHAR2(2),
RESIDENCE_ZIP VARCHAR2(9),
RACE VARCHAR2(2),
DATE_OF_BIRTH DATE,
GENDER VARCHAR2(1),
TELEPHONE VARCHAR2(10),
SSN VARCHAR2(9)
The incoming and existing data can and does have typographic errors in any/all fields. I have written a probabilistic algorithm which will take an existing record, incoming record and score their similarity reasonably well (99.99%+).
The problem is performance. The match of two records is reasonably quick, but the dataset I need to match against currently has over 3.9 million rows. So obviously I can't try to match against all records in the dataset.
The common way around this is to limit searches using deterministic matches against limited subsets of the data (blocking). Soundex and double metaphone "hashing" is used on name fields, DOB is split into year and MMDD segments, and this blocking yields good results but unless I cast a wide net, I miss some matches. If I cast a wide net, the performance degrades.
So the questions are;
What types of "hashing" can I do, other than double metaphone & soundex, on the data elements which would be suitable for exact or range matching which would yield small subsets of data likely to contain the "best" match?
Is there a better approach to creating a suitable data structure for matching?
The data is contained in an Oracle DB 19c the main language at my disposal is PL/SQL.

You should either add your algorithm that makes a reasonable score or add additional information - against what input you should match.
For example:
RESIDENCE_CITY VARCHAR2(50),
RESIDENCE_STATE VARCHAR2(2),
RESIDENCE_ZIP VARCHAR2(9)
Should either not contains errors or those errors could be much easier detected and corrected.
In this case you can create index on these three columns and run your algorithm on those that matches exact (or matches after correction) these three columns.
So my suggestion would be - to divide original data on smaller groups that can be matched more precisly and then run you algorithm based on this smaller group.

Related

Aggregate on Redshift SUPER type

Context
I'm trying to find the best way to represent and aggregate a high-cardinality column in Redshift. The source is event-based and looks something like this:
user
timestamp
event_type
1
2021-01-01 12:00:00
foo
1
2021-01-01 15:00:00
bar
2
2021-01-01 16:00:00
foo
2
2021-01-01 19:00:00
foo
Where:
the number of users is very large
a single user can have very large numbers of events, but is unlikely to have many different event types
the number of different event_type values is very large, and constantly growing
I want to aggregate this data into a much smaller dataset with a single record (document) per user. These documents will then be exported. The aggregations of interest are things like:
Number of events
Most recent event time
But also:
Number of events for each event_type
It is this latter case that I am finding difficult.
Solutions I've considered
The simple "columnar-DB-friendy" approach to this problem would simply be to have an aggregate column for each event type:
user
nb_events
...
nb_foo
nb_bar
1
2
...
1
1
2
2
...
2
0
But I don't think this is an appropriate solution here, since the event_type field is dynamic and may have hundreds or thousands of values (and Redshift has a upper limit of 1600 columns). Moreover, there may be multiple types of aggregations on this event_type field (not just count).
A second approach would be to keep the data in its vertical form, where there is not one row per user but rather one row per (user, event_type). However, this really just postpones the issue - at some point the data still needs to be aggregated into a single record per user to achieve the target document structure, and the problem of column explosion still exists.
A much more natural (I think) representation of this data is as a sparse array/document/SUPER:
user
nb_events
...
count_by_event_type (SUPER)
1
2
...
{"foo": 1, "bar": 1}
2
2
...
{"foo": 2}
This also pretty much exactly matches the intended SUPER use case described by the AWS docs:
When you need to store a relatively small set of key-value pairs, you might save space by storing the data in JSON format. Because JSON strings can be stored in a single column, using JSON might be more efficient than storing your data in tabular format. For example, suppose you have a sparse table, where you need to have many columns to fully represent all possible attributes, but most of the column values are NULL for any given row or any given column. By using JSON for storage, you might be able to store the data for a row in key:value pairs in a single JSON string and eliminate the sparsely-populated table columns.
So this is the approach I've been trying to implement. But I haven't quite been able to achieve what I'm hoping to, mostly due to difficulties populating and aggregating the SUPER column. These are described below:
Questions
Q1:
How can I insert into this kind of SUPER column from another SELECT query? All Redshift docs only really discuss SUPER columns in the context of initial data load (e.g. by using json_parse), but never discuss the case where this data is generated from another Redshift query. I understand that this is because the preferred approach is to load SUPER data but convert it to columnar data as soon as possible.
Q2:
How can I re-aggregate this kind of SUPER column, while retaining the SUPER structure? Until now, I've discussed a simplified example which only aggregates by user. In reality, there are other dimensions of aggregation, and some analyses of this table will need to re-aggregate the values shown in the table above. By analogy, the desired output might look something like (aggregating over all users):
nb_events
...
count_by_event_type (SUPER)
4
...
{"foo": 3, "bar": 1}
I can get close to achieving this re-aggregation with a query like (where the listagg of key-value string pairs is a stand-in for the SUPER type construction that I don't know how to do):
select
sum(nb_events) nb_events,
(
select listagg(s)
from (
select
k::text || ':' || sum(v)::text as s
from my_aggregated_table inner_query,
unpivot inner_query.count_by_event_type as v at k
group by k
) a
) count_by_event_type
from my_aggregated_table outer_query
But Redshift doesn't support this kind of correlated query:
[0A000] ERROR: This type of correlated subquery pattern is not supported yet
Q3:
Are there any alternative approaches to consider? Normally I'd handle this kind of problem with Spark, which I find much more flexible for these kinds of problems. But if possible it would be great to stick with Redshift, since that's where the source data is.

How can I create an arbitrary ranking of records in Postgres?

The Problem
I'm looking to create a user defined ranking of records in Postgres.
That is, the order in which the records are ranked is not defined by some underlying score but rather via the choices of a collection of users.
These choices are subject to frequent changes and the ranking will be constantly changing with both new records being added and existing records being moved to new positions.
For the sake of argument, assume that these operations occur with high frequency.
Furthermore, we need to be able to determine when given an arbitrary subset of all records, how they should be ordered according to the ranking.
A Naive Solution
A very naive solution would be to track the ranking as an integer directly on the model and 'push' all the higher ranked records up by one when inserting a new record. This is obviously not ideal as we would need to modify potentially the entire table at once.
A Better Solution
Maintain a 'score' on each record in the interval [0, 1]. This can be indexed using a BTREE and used to rank the records. The first two records would have the scores 0 and 1. When inserting a new record some intermediate value would be chosen (e.g. 0.5) and the record inserted. This choice could be optimised in order to minimise the number of digits in the score.
A Question
The above seems like a complex solution to a common problem. Furthermore, the problem is actually being solved by the underlying BTREE index with the score something of a hack to create the index.
Is there a neater way to solve the problem?

PostgreSQL - Compare ts_vector fields

I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.

Is there any way for Access 2016 to sort the numbers that are part of a "text" data type formatted field as though they are numeric values?

I am working on a database that (hopefully) will end up using a primary key with both numbers and letters in the values to track lots of agricultural product. Due to the way in which the weighing of product takes place at more than one facility, I have no other option but to maintain the same base number but use letters in addition to this base number to denote split portions of each lot of product. The problem is, after I create record number 99, the number 100 suddenly floats up and underneath 10. This makes it difficult to maintain consistency and forces me to replace this alphanumeric lot ID with a strictly numeric value in order to keep it sorted (which I use "autonumber" as the data type). Either way, I need the alphanumeric lot ID, and so having 2 ID's for the same lot can be confusing for anyone inputting values into the form. Is there a way around this that I am just not seeing?
If you're using query as a data source then you may try to sort it by string converted to number, something like
SELECT id, field1, field2, ..
ORDER BY CLng(YourAlphaNumericField)
Edit: you may also try Val function instead of CLng - it should not fail on non-numeric input
Why not properly format your key before saving ? e.g: "0000099". You will avoid a costly conversion later.
Alternatively, you could use 2 fields as the composite PK. One with the Number (as Long) and one with the Location (as String).

How to avoid column name conflicts in cassandra

I need to store a list of user names in a Cassandra column family(wide row/dynamic columns).
The columnname/comparator type will be integer, so as to sort the users based on a score.
The score ranges from 0 to 100. The problem is, if two or more users have a same score, how can i store them on different columns?, as cassandra would not allow that...
Is there any way to convert integer to timeuuids? Or any other solution for this problem?
This is a problem I have seen quite often (not scores but preventing column name conflict). In general the solution is a form or another of concatenating a UUID to the column name (Since those are made to never conflict).
If you want to keep on sorting by score then I advice you to use a CompositeType column name.
More specifically:
CompositeType(score: Integer | time: TimeUUID)
The comparator in Cassandra will then first sort by score and then by time (putting the most recent last I believe).
TimeUUID should also take care of "simultaneous" score posting even thought the probabilities to have that with a Long timestamp would be ridiculously low.
You can use build-in list feature, see http://www.datastax.com/dev/blog/cql3_collections
Just have column with a value and list of users for that value.