How can I create an arbitrary ranking of records in Postgres? - postgresql

The Problem
I'm looking to create a user defined ranking of records in Postgres.
That is, the order in which the records are ranked is not defined by some underlying score but rather via the choices of a collection of users.
These choices are subject to frequent changes and the ranking will be constantly changing with both new records being added and existing records being moved to new positions.
For the sake of argument, assume that these operations occur with high frequency.
Furthermore, we need to be able to determine when given an arbitrary subset of all records, how they should be ordered according to the ranking.
A Naive Solution
A very naive solution would be to track the ranking as an integer directly on the model and 'push' all the higher ranked records up by one when inserting a new record. This is obviously not ideal as we would need to modify potentially the entire table at once.
A Better Solution
Maintain a 'score' on each record in the interval [0, 1]. This can be indexed using a BTREE and used to rank the records. The first two records would have the scores 0 and 1. When inserting a new record some intermediate value would be chosen (e.g. 0.5) and the record inserted. This choice could be optimised in order to minimise the number of digits in the score.
A Question
The above seems like a complex solution to a common problem. Furthermore, the problem is actually being solved by the underlying BTREE index with the score something of a hack to create the index.
Is there a neater way to solve the problem?

Related

How can I best construct data structures to retrieve similar values for demographic matching?

The job is person demographic matching/consolidation.
I have incoming person demographic information which I need to determine if it is a match against an existing person in the a dataset. I get the following data;
NAME_LAST VARCHAR2(40),
NAME_FIRST VARCHAR2(40),
NAME_MIDDLE VARCHAR2(40),
NAME_MAIDEN VARCHAR2(40),
RESIDENCE_ADDRESS VARCHAR2(60),
RESIDENCE_CITY VARCHAR2(50),
RESIDENCE_STATE VARCHAR2(2),
RESIDENCE_ZIP VARCHAR2(9),
RACE VARCHAR2(2),
DATE_OF_BIRTH DATE,
GENDER VARCHAR2(1),
TELEPHONE VARCHAR2(10),
SSN VARCHAR2(9)
The incoming and existing data can and does have typographic errors in any/all fields. I have written a probabilistic algorithm which will take an existing record, incoming record and score their similarity reasonably well (99.99%+).
The problem is performance. The match of two records is reasonably quick, but the dataset I need to match against currently has over 3.9 million rows. So obviously I can't try to match against all records in the dataset.
The common way around this is to limit searches using deterministic matches against limited subsets of the data (blocking). Soundex and double metaphone "hashing" is used on name fields, DOB is split into year and MMDD segments, and this blocking yields good results but unless I cast a wide net, I miss some matches. If I cast a wide net, the performance degrades.
So the questions are;
What types of "hashing" can I do, other than double metaphone & soundex, on the data elements which would be suitable for exact or range matching which would yield small subsets of data likely to contain the "best" match?
Is there a better approach to creating a suitable data structure for matching?
The data is contained in an Oracle DB 19c the main language at my disposal is PL/SQL.
You should either add your algorithm that makes a reasonable score or add additional information - against what input you should match.
For example:
RESIDENCE_CITY VARCHAR2(50),
RESIDENCE_STATE VARCHAR2(2),
RESIDENCE_ZIP VARCHAR2(9)
Should either not contains errors or those errors could be much easier detected and corrected.
In this case you can create index on these three columns and run your algorithm on those that matches exact (or matches after correction) these three columns.
So my suggestion would be - to divide original data on smaller groups that can be matched more precisly and then run you algorithm based on this smaller group.

PostgreSQL - Compare ts_vector fields

I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.

Sphinx centralize multiple tables into a single index

I do have multiple tables (MySQL) and I want to have a single index for them.
Each table has the primary key of int autoincrement type.
The structure of collected data is the same for each table (so no conflict), but as the IDs collide so it seems that I have to query each index separately (unless you can give me a hint of how to avoid ID collision)
Question is: If I query each index separately does it means that the weight of returned results are comparable between indexes?
unless you can give me a hint of how to avoid ID collision
See for example
http://sphinxsearch.com/forum/view.html?id=13078
You can just arrange for the ids to be offset differently. The 'sphinx document id' doesnt have to match the real primary key, but having a simple mapping makes the application simpler.
You have a choice between one-index, one-source (using a single sql query to union all the tables together. one-index, many-source. (a source per table, all making one index) or many-indexes (one index per table, each with own source). Which ever way will give the same query results.
If I query each index separately does it means that the weight of returned results are comparable between indexes?
Pretty much. The difference should be negiblibe that doesnt matter whic way round you do it.

100 columns vs Array of length 100

I have a table with 100+ values corresponding to each row, so I'm exploring different ways to store them.
Without any indexes, would I lose anything if I store these 100 values in an integer[] column in postgresql? As compared to storing them in separate columns.
Plus, since we can add indexes to array elemnets,
CREATE INDEX test_index on test ((foo[1]));
Would there be a performance difference queries using such an index as compared to regular index on a column?
As far as I've read, this performance difference would come into picture in arrays with variable length elements; but I'm not sure about fixed length ones.
Don't go for the lazy way.
If you need to store 100 and more values as array, it is ok, if it has sense has array for your application, your data.
If you need to query for a specific element of the array, then this design is not good, regardless of performances, and you must use columns. This will help you in the moment you must delete a "column" in the middle or redesign it.
Anyway, as wrote by Frank in comments, if values are all same type, consider to model them to another table (if also the meaning is the same).

How to avoid column name conflicts in cassandra

I need to store a list of user names in a Cassandra column family(wide row/dynamic columns).
The columnname/comparator type will be integer, so as to sort the users based on a score.
The score ranges from 0 to 100. The problem is, if two or more users have a same score, how can i store them on different columns?, as cassandra would not allow that...
Is there any way to convert integer to timeuuids? Or any other solution for this problem?
This is a problem I have seen quite often (not scores but preventing column name conflict). In general the solution is a form or another of concatenating a UUID to the column name (Since those are made to never conflict).
If you want to keep on sorting by score then I advice you to use a CompositeType column name.
More specifically:
CompositeType(score: Integer | time: TimeUUID)
The comparator in Cassandra will then first sort by score and then by time (putting the most recent last I believe).
TimeUUID should also take care of "simultaneous" score posting even thought the probabilities to have that with a Long timestamp would be ridiculously low.
You can use build-in list feature, see http://www.datastax.com/dev/blog/cql3_collections
Just have column with a value and list of users for that value.