Postgresql full text search part of words - postgresql

Is postresql capable of doing a full text search, based on 'half' a word?
For example I'm trying to seach for "tree", but I tell postgres to search for "tr".
I can't find such a solution that is capable of doing this.
Currently I'm using
select * from test, to_tsquery('tree') as q where vectors ## q ;
But I'd like to do something like this:
select * from test, to_tsquery('tr%') as q where vectors ## q ;

You can use tsearch prefix matching, see http://www.postgresql.org/docs/9.0/interactive/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES
postgres=# select to_tsvector('tree') ## to_tsquery('tr:*');
?column?
----------
t
(1 row)
It will only work for prefix search though, not if you want partial match at any position in the word.

Sounds like you simply want wildcard matching.
One option, as previously mentioned is trigrams. My (very) limited experience with it was that it was too slow on massive tables for my liking (some cases slower than a LIKE). As I said, my experience with trigrams is limited, so I might have just been using it wrong.
A second option you could use is the wildspeed module: http://www.sai.msu.su/~megera/wiki/wildspeed
(you'll have to build & install this tho).
The 2nd option will work for suffix/middle matching as well. Which may or may not be more than you're looking for.
There are a couple of caveats (like size of the index), so read through that page thoroughly.

select * from test, to_tsquery('tree') as q
where vectors ## q OR xxxx LIKE ('%tree%')
':*' is to specify prefix matching.

It can be done with trigrams but it's not part of tsearch2.
You can view the manual here: http://www.postgresql.org/docs/9.0/interactive/pgtrgm.html
Basically what the pg_tgrm module does is split a word in all it's parts so it can search for those separate parts.

Related

What is the syntax for AND OR NOT in Postgres trigram search?

I have implemented Postgres 9.6 trigram search https://www.postgresql.org/docs/9.6/static/pgtrgm.html into my application which works fine for a single search term.
I can't see how to allow my users to do AND OR NOT searches though.
Currently, if I put "perl" into the search field, it will return hundreds of results. That's great and works fine.
Now if I want to search for documents containing "perl" and also containing "javascript", no matter what search term I put in, no results come back.
I have tried for example:
"perl javascript"
"perl AND javascript"
"perl && javascript"
So I am trying to work out how I can provide to my end users a more sophisticated search than single term only. I would like my application users to be able to do full text searches with and/or/not.
Is it possible? If yes, what is the syntax?
This query finds ...
documents containing "perl" and also containing "javascript"
SELECT *
FROM tbl
WHERE document ~ 'perl'
AND document ~ 'javascript';
Note that "perlane" or "javascripting" or "Kaperl" also qualify. To search for whole words, you might be interested in text search instead. Overview:
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

UIMA Ruta. Retrieve phrases separated by WS (spaces, breaks, etc.)

I'm going to retrieve phrases separated by spaces, breaks and other punctuation symbols.
I've spent a lot of time trying to find out the best way to do that.
Option 1. The easiest way.
DECLARE T1, T2;
"cool rules" -> T1;
"cool rule" -> T2;
Input: "123cool rules".
Result: T1 and T2 are triggered;
Option 2. Using WORDLIST and WORDTABLE.
Let wordlist 1.txt contains 2 rows:
cool rules
cool
code for extraction is the following
WORDLIST WList = '1.txt';
DECLARE W1;
Document{-> MARKFAST(W1, WList, true, 2)};
Input: "cool rules".
Result: only first row is extracted. I guess that in this case intersected rules are not triggered.
Option 3. Mark combination of two tokens
DECLARE T1;
("cool" "rule") {-> T1};
Input: "cool rules cool rule 1cool rule"
Result: 2 annotations: cool rule + 1cool rule. Loss of extraction speed in 10 times.
Option 4. REGEXP matching
Maybe it is possible to match such pattern "cool\\srule", but I have no idea how to define the type expression. SW*{REGEXP("cool\\srule")->T1} does not provide results.
As you see, I'm trying to solve a very simple task, but did not succeed yet. The option 3 is a really good way to do that, but extraction process becomes slower in 10 times.
If you want to identify specific phrases, you should use a dictionary lookup, not directly rules.
Therefore, I'd recommend the MARKFAST option 2. However, there are two problems: (a) only longest matches are supported and (b) you either need to change the segmentation (tokenization) or do some postprocessing.
(a) This cannot be solved. If this is really required, a different dictionary annotator should be used. See e.g., the UIMA mailing lists.
(b) The MARKFAST works on RutaBasic annotations which are automatically created for each smallest part. Because of the default seeder, the token "1cool" consists of two RutaBasics, one for the NUM, one for the SW. If you do not want to change the preprocessing, you can simply apply a rule that fixed that like
RETAINTYPE(WS);
ANY{-PARTOF(WS)} t:#T1{-> UNMARK(t)};
btw, option 4 won't work because the REGEXP condition checks on the covered text of the matched annotation SW which only represents one token. If you do something like (SW+){REGEXP("cool\\srule")->T1}, then the rule wont match if there is another SW afterwards.
DISCLAIMER: I am a developer Of UIMA Ruta

How to get wordforms in sphinx?

How i can get all morphology forms of the word?
For example, searching keyword is:
runner
Result should be:
run,running ... etc
You usually don't need one. Use a stemmer, which can do the reverse. ie it removes the "ending", so that matching works, rather than trying to figure out all the possible endings.
https://en.wikipedia.org/wiki/Stemming
ie use morphology, rather than worforms.
http://sphinxsearch.com/docs/current.html#conf-morphology

Sphinx with metaphone and wildcard search

we are an anatomy platform and use sphinx for our search. We want to make our search more fuzzier and started to use metaphone to correct spelling mistakes. It finds for example phalanges even though the search word is falanges.
That's good but we want more. We want that the user could type in falange or even falang and we still find phalanges. Any ideas how to accomplish this?
If you are interested you can checkout our sphinx config file here.
Thanks!
Well you can enable both metaphone and min_prefix_len on an index at once. It will sort of work.
falange*
might then just work. (to match phalanges)
The problem is the 'stripped' letters may change the 'sound' of the word (because change the pronunciation)
eg falange becomes FLNJ, but falang acully becomes FLNK - so they no longer 'substrings' of one another. (ie phalanges becomes FLNJS, which FLNK* wont match)
... to be honest I dont know a good solution. You could perhaps get better results, if was to apply stemming, BEFORE metaphone. (so the endings that change the pronouncation of the words are removed.
Alas Sphinx can't do this. If you enable stemming and metaphone together, only ONE of the processors will ever fire.
Two possible solutions, implement stemming outside of sphinx (or maybe with regexp_filter. Not sure if say a porter stemmer can be implemnented purely with regular expressions)
or modify sphinx, so that ALL morphology processors apply. (rather than just the first one that changes the word)

Sphinx search configuration for words ending with apostrophes

I am trying to improve my Sphinx configuration and I have a trouble with words ending with apostrophes.
For example, for Surfin' USA result, searching with "Surfin USA" returns match but "Surfing USA" doesn't return anything. How can I set Sphinx to return result for such situation?
Hmm, thats an interesting one. Not sure sphinx can automatically deal with this, because it has no way of knowing what the Apostrophe is meant to represent. I suppose there are cases where it could be multiple things.
The only way I can think would be to list them in exceptions, you can build a list of all words want to support
Surfin' > Surfing
Have to use exceptions to be able to use the apostrophe
You might want to add
Surfin > Surfing
too, so can search without the apostrophe too.