Sphinx - weigh results by the value of a field - sphinx

I have a large table with the columns name, phone number, and type of relationship to me (friend, family, acquaintance, etc.). When I search a name in Sphinx I want the results with the field value "family" to be weighted higher than "acquaintance." How do I manually set the weight of a certain row so that it is weighted higher?

Many options...
A) you could store the 'type' in an attribute - attributes are stored in teh index (unlike fields) and can be used to sort the results.
How exactly you do it thou is largely a matter of personal preference.
For example:
sql_query = SELECT id,name,phone,type,(type=family) as boost from table
sql_attr_bool = boost
Then queries can be sorted by boost attribute. ... ORDER BY boost DESC, WEIGHT() DESC
You could also just store the type as a plain numeric integer, with a value for each type, setup in such a way can just naturally sort by that column.
or B) you can actually just leave the type as a field, and boost the results,
for example extended query
john | (john #type family)
Need john on both sides of the OR, so that will always have rows that include john. But because its on both sides, results that also match family, will match more keywords, and rank higher.
Can choose how much it affects the results, using field-weights option.
(latest versions of sphinx has MAYBE operator to do this even easier)

Related

How can I best construct data structures to retrieve similar values for demographic matching?

The job is person demographic matching/consolidation.
I have incoming person demographic information which I need to determine if it is a match against an existing person in the a dataset. I get the following data;
NAME_LAST VARCHAR2(40),
NAME_FIRST VARCHAR2(40),
NAME_MIDDLE VARCHAR2(40),
NAME_MAIDEN VARCHAR2(40),
RESIDENCE_ADDRESS VARCHAR2(60),
RESIDENCE_CITY VARCHAR2(50),
RESIDENCE_STATE VARCHAR2(2),
RESIDENCE_ZIP VARCHAR2(9),
RACE VARCHAR2(2),
DATE_OF_BIRTH DATE,
GENDER VARCHAR2(1),
TELEPHONE VARCHAR2(10),
SSN VARCHAR2(9)
The incoming and existing data can and does have typographic errors in any/all fields. I have written a probabilistic algorithm which will take an existing record, incoming record and score their similarity reasonably well (99.99%+).
The problem is performance. The match of two records is reasonably quick, but the dataset I need to match against currently has over 3.9 million rows. So obviously I can't try to match against all records in the dataset.
The common way around this is to limit searches using deterministic matches against limited subsets of the data (blocking). Soundex and double metaphone "hashing" is used on name fields, DOB is split into year and MMDD segments, and this blocking yields good results but unless I cast a wide net, I miss some matches. If I cast a wide net, the performance degrades.
So the questions are;
What types of "hashing" can I do, other than double metaphone & soundex, on the data elements which would be suitable for exact or range matching which would yield small subsets of data likely to contain the "best" match?
Is there a better approach to creating a suitable data structure for matching?
The data is contained in an Oracle DB 19c the main language at my disposal is PL/SQL.
You should either add your algorithm that makes a reasonable score or add additional information - against what input you should match.
For example:
RESIDENCE_CITY VARCHAR2(50),
RESIDENCE_STATE VARCHAR2(2),
RESIDENCE_ZIP VARCHAR2(9)
Should either not contains errors or those errors could be much easier detected and corrected.
In this case you can create index on these three columns and run your algorithm on those that matches exact (or matches after correction) these three columns.
So my suggestion would be - to divide original data on smaller groups that can be matched more precisly and then run you algorithm based on this smaller group.

Better Postgres trigram ranking

I'm searching several million names and addresses in a Postgres table. I'd like to use pg_trgm to do fast fuzzy search.
My app is actually very similar to the one in Optimizing a postgres similarity query (pg_trgm + gin index), and the answer there is pretty good.
My problem is that the relevance ranking isn't very good. There are two issues:
I want names to get a heavier weight in the ranking than addresses, and it's not clear how to do that and still get good performance. For example, if a user searches for 'smith', I want 'Bob Smith' to appear higher in the results than '123 Smith Street'.
The current results are biased toward columns that contain fewer characters. For example, a search for 'bob' will rank 'Bobby Smith' (without an address) above 'Bob Smith, 123 Bob Street, Smithville Illinois, 12345 with some other info here'. The reason for this is that the similarity score penalizes for parts of the string that do not match the search terms.
I'm thinking that I'll get a much better result if I could get a score that simply returns the number of matched trigrams in a record, not the number of trigrams scaled by the length of the target string. That's the way most search engines (like Elastic) work -- they rank by the weighted number of hits and do not penalize long documents.
Is it possible to do this with pg_trgm AND get good (sub-second) performance? I could do an arbitrary ranking of results, but if the ORDER BY clause does not match the index, then performance will be poor.
I know that this is an old question but this might be useful for others.
If the text you want to search falls in ascii table (characters in the range of [a-zA-Z0-9] and some other symbols), then you probably want to use Full Text Search feature (read official document Full Text Search).
because not only it gives you ability to sort by relevancy, but also ability to customize things like text steming (using snowball algorithm), which maps words like connection, connections, connective, connected, and connecting to connect (Read more about Snowball Stem). This makes your application performs better on search.
But if your requirement is to search text out of range of ascii table, like unicode, which is common if you try to support Asian languages like Japanese, Thai, Korean, etc. then using pg_trgm is perfectly fine.
To do the search that is not biased to the shorter text as mentioned in the question, you could use word_similarity(), instead of similarity().
As per the official documentation:
word_similarity( text, text )
Returns a number that indicates the greatest similarity between the set of trigrams in the first string and any continuous extent of an ordered set of trigrams in the second string. For details, see the explanation below.
So for example:
postgres=# SELECT word_similarity('white cat', 'white dog and black cat') as "similarity 1", word_similarity('white cat', 'I have a white dog and a black cat') as "similarity 2", word_similarity('white cat', 'I have a lovely white dog and a cute big black cat in a house') as "similarity 3";
similarity 1 | similarity 2 | similarity 3
--------------+--------------+--------------
0.6 | 0.6 | 0.6
(1 row)
As shown above, they all have equal scores.
And when you want to use it in a query:
SELECT col, word_similarity('some query', col) from my_table where col <% 'some query';
According to the document:
text <% text → boolean
Returns true if the similarity between the trigram set in the first argument and a continuous extent of an
ordered trigram set in the second argument is greater than the current
word similarity threshold set by pg_trgm.word_similarity_threshold
parameter.
For somehting more complicated like calculating hit scores, relevance weight/boost, and faster response time on larger dataset, you should use Elastic instead, but keep in mind that Elastic instance needs at least 2GB of ram and more, so you need dedicated EC2 instance(s) for that purpose. But for small-medium app, pg_trgm works just fine while saving your server cost.
Hope you find this helpful.

Postgresql Query - Return all matching search terms for each result row when using an ANY query and LIKE

Essentially what I'm trying to figure out is if there is a way to return all matching search terms in addition to the matched row when running a query that looks up a list of items using ANY or IN. In most cases the search term will exactly match the returned column value but in cases such as text search or with certain extensions like IP4r this is not always the case. In addition, you can have multiple search terms match on a single row.
To make this concrete suppose this is my query:
SELECT id, item_name, description FROM items WHERE description LIKE ANY('{%gaming%, %computer%, %socks%, %men%}');
and it returns the following two rows:
id, item_name, description
1, 'computer', 'super fast gaming computer that will help you win'
5, 'socks', 'These socks are sure to please the men in your family'
What I'd like to know is which original search terms map to the result row that was returned. In other words, I'd like the returned rows to look like this:
id, search_terms, item_name, description
1, '{%gaming%, %computer%}', 'computer', 'super fast gaming computer that will help you win'
5, '{%socks%, %men%}', 'socks', 'These socks are sure to please the men in your family'
Is there a way to efficiently do this in PostgreSQL? In the example above we're using LIKE with strings but in my real-world scenario I'm using the IP4r extension to do IP lookups against CIDR ranges where you can have multiple IP addresses in the same returned CIDR range.
I previously asked this question: PostgreSQL 9.5: Return matching search terms in each result row when using LIKE which used a CASE statement to almost solve the problem I'm describing here.
The added complexity in the scenario above is that you can have multiple search terms match a single row (e.f., gaming and computer are both matches for the description super fast gaming computer that will help you win). If you use a CASE statement then only the first match in the CASE statement gets set as the search term and you miss any other matching search terms.
Thank you for your help!
This would be a way using VALUES:
SELECT i.id, i.item_name, i.description, m.pat
FROM items AS i
JOIN (VALUES ('%gaming%'), ('%computer%'), ('%socks%'), ('%men%')) AS m(pat)
ON i.description LIKE m.pat;

Sphinx Mulit-Level Sort with Randomize

Here is my challenge with Sphinx Sort where I have Vendors who pay for premium placement and those who don't:
I already do a multi-level order including the PaidVendorStatus which is either 0 or 1 as:
order by PaidVendorStatus,Weight()
So in essence I end up with multiple sort groups:
PaidVendorStatus=1, Weight1
....
PaidVendorStatus=1, WeightN
PaidVendorStatus=0, Weight1
...
PaidVendorStatus=0, WeightN
The problem is I have three goals:
Randomly prioritize each vendor in any given sort group
Have each vendor's 'odds' of being randomly assigned top position be equal regardless of how many records they have returned in the group (so if Vendor A has 50 results and VendorB has 2 results they still both have 50% odds of being randomly assigned any given spot)
Ideally, maintain the same results order in any given search (so that if the user searches again the same order will be displayed
I've tried various solutions:
Select CRC32(Vendor) as RANDOM...Order by PaidVendorStatus,Weight(),RANDOM
which solves 2 and 3 except due to the nature of CRC32 ALWAYS puts the same vendor first (and second, third, etc.) so in essence does not solve the issue at all.
I tried making a sphinx sql_attr_string in my Sphinx Configuration which was a concatenation of Vendor and the record Title (Select... concat(Vendor,Title) as RANDOMIZER..)` and then used that to randomize
Select CRC32(RANDOMIZER) as RANDOM...
which solves 1 and 3 as now the Title field gets thrown in the randomization mis so that the same Vendor does not always get first billing. However, it fails at 2 since in essence I am only sorting by Title and thus Vendor B with two results now has a very low change of being sorted first.
In an ideal world naturally I could just order this way;
Order by PaidVendorStatus,Weight(),RAND(Vendor)
but that is not possible.
Any thoughts on this appreciated. I did btw check out as per Barry Hunter's suggestion this thread on UDF but unless I am not understanding it at all (possible) it does not seem to be the solution for this problem.
Well one idea is:
SELECT * FROM (
SELECT *,uniqueserial(vendor_id) AS sorter FROM index WHERE MATCH(...)
ORDER BY PaidVendorStatus DESC ,Weight() DESC LIMIT 1000
) ORDER BY sorter DESC, WEIGHT() DESC:
This exploits SPhixnes 'multiple sort' function with pysudeo subquery.
This works wors becasuse the inner query is sorted by PaidVendor first, so their items are fist. Which works to affect the ordr that unqique serial is called in.
Its NOT really 'randomising' the results as such, seems you jsut randomising them to mix up the vendors (so a single vendor doesnt domninate results. Uniqueserial works by 'spreading' the particular vendors results out - the results will tend to cycle through the vendors.
This is tricky as it exploits a relative undocumented sphinx feature - subqueries.
For the UDF see http://svn.geograph.org.uk/svn/modules/trunk/sphinx/
Still dont have an answer for your biased random (as in 2.)
but just remembered another feature taht can help with 3. - can supply s specific seed to the random number. Typically random generators are seeded from the current time, which gives ever changing values, But using a specific seed.
Seed is however a number, so need a predictable, but changing number. Could CRC the query?
... sphinx doesnt support expressions in the OPTION so would have to caculate the hash in the app.
<?php
$query = $db->Quote($_GET['q']);
$crc = crc32($query);
$sql = "SELECT id,IDIV(WEIGHT(),100) as i,RAND() as r FROM index WHERE MATCH($query)
ORDER BY PaidVendorStatus DESC,i DESC,r ASC OPTION random_seed=$crc";
If wanted the results to only slowly evolve, add the current date, so each day is a new selection...
$crc = crc32($query.date('Ymd'));

How to index a postgres table by name, when the name can be in any language?

I have a large postgres table of locations (shops, landmarks, etc.) which the user can search in various ways. When the user wants to do a search for the name of a place, the system currently does (assuming the search is on cafe):
lower(location_name) LIKE '%cafe%'
as part of the query. This is hugely inefficient. Prohibitively so. It is essential I make this faster. I've tried indexing the table on
gin(to_tsvector('simple', location_name))
and searching with
(to_tsvector('simple',location_name) ## to_tsquery('simple','cafe'))
which works beautifully, and cuts down the search time by a couple of orders of magnitude.
However, the location names can be in any language, including languages like Chinese, which aren't whitespace delimited. This new system is unable to find any Chinese locations, unless I search for the exact name, whereas the old system could find matches to partial names just fine.
So, my question is: Can I get this to work for all languages at once, or am I on the wrong track?
If you want to optimize arbitrary substring matches, one option is to use the pg_tgrm module. Add an index:
CREATE INDEX table_location_name_trigrams_key ON table
USING gin (location_name gin_trgm_ops);
This will break "Simple Cafe" into "sim", "imp", "mpl", etc., and add an entry to the index for each trigam in each row. The query planner can then automatically use this index for substring pattern matches, including:
SELECT * FROM table WHERE location_name ILIKE '%cafe%';
This query will look up "caf" and "afe" in the index, find the intersection, fetch those rows, then check each row against your pattern. (That last check is necessary since the intersection of "caf" and "afe" matches both "simple cafe" and "unsafe scaffolding", while "%cafe%" should only match one). The index becomes more effective as the input pattern gets longer since it can exclude more rows, but it's still not as efficient as indexing whole words, so don't expect a performance improvement over to_tsvector.
Catch is, trigrams don't work at all for patterns that under three characters. That may or may not be a deal-breaker for your application.
Edit: I initially added this as a comment.
I had another thought last night when I was mostly asleep. Make a cjk_chars function that takes an input string, regexp_matches the entire CJK Unicode ranges, and returns an array of any such characters or NULL if none. Add a GIN index on cjk_chars(location_name). Then query for:
WHERE CASE
WHEN cjk_chars('query') IS NOT NULL THEN
cjk_chars(location_name) #> cjk_chars('query')
AND location_name LIKE '%query%'
ELSE
<tsvector/trigrams>
END
Ta-da, unigrams!
For full text search in a multi-language environment you need to store the language each datum is in along side the text its self. You can then use the language-specified flavours of the tsearch functions to get proper stemming, etc.
eg given:
CREATE TABLE location(
location_name text,
location_name_language text
);
... plus any appropriate constraints, you might write:
CREATE INDEX location_name_ts_idx
USING gin(to_tsvector(location_name_language, location_name));
and for search:
SELECT to_tsvector(location_name_language,location_name) ## to_tsquery('english','cafe');
Cross-language searches will be problematic no matter what you do. In practice I'd use multiple matching strategies: I'd compare the search term to the tsvector of location_name in the simple configuration and the stored language of the text. I'd possibly also use a trigram based approach like willglynn suggests, then I'd unify the results for display, looking for common terms.
It's possible you may find Pg's fulltext search too limited, in which case you might want to check out something like Lucerne / Solr.
See:
* controlling full text search.
* tsearch dictionaries
Similar to what #willglynn already posted, I would consider the pg_trgm module. But preferably with a GiST index:
CREATE INDEX tbl_location_name_trgm_idx
USING gist(location_name gist_trgm_ops);
The gist_trgm_ops operator class ignore case generally, and ILIKE is just as fast as LIKE. Quoting the source code:
Caution: IGNORECASE macro means that trigrams are case-insensitive.
I use COLLATE "C" here - which is effectively no special collation (byte order instead), because you obviously have a mix of various collations in your column. Collation is relevant for ordering or ranges, for a basic similarity search, you can do without it. I would consider setting COLLATE "C" for your column to begin with.
This index would lend support to your first, simple form of the query:
SELECT * FROM tbl WHERE location_name ILIKE '%cafe%';
Very fast.
Retains capability to find partial matches.
Adds capability for fuzzy search.
Check out the % operator and set_limit().
GiST index is also very fast for queries with LIMIT n to select n "best" matches. You could add to the above query:
ORDER BY location_name <-> 'cafe'
LIMIT 20
Read more about the "distance" operator <-> in the manual here.
Or even:
SELECT *
FROM tbl
WHERE location_name ILIKE '%cafe%' -- exact partial match
OR location_name % 'cafe' -- fuzzy match
ORDER BY
(location_name ILIKE 'cafe%') DESC -- exact beginning first
,(location_name ILIKE '%cafe%') DESC -- exact partial match next
,(location_name <-> 'cafe') -- then "best" matches
,location_name -- break remaining ties (collation!)
LIMIT 20;
I use something like that in several applications for (to me) satisfactory results. Of course, it gets a bit slower with multiple features applied in combination. Find your sweet spot ...
You could go one step further and create a separate partial index for every language and use a matching collation for each:
CREATE INDEX location_name_trgm_idx
USING gist(location_name COLLATE "de_DE" gist_trgm_ops)
WHERE location_name_language = 'German';
-- repeat for each language
That would only be useful, if you only want results of a specific language per query and would be very fast in this case.