I need to process strings like this "hello world #mention a #hashtag" and index them for searching using PostgreSQL. I do need to treat #mention and #hashtag specially.
The following produces a tsvector:
select to_tsvector('hello world #mention a #hashtag')
But the output looks like this:
"'a':4 'hashtag':5 'hello':1 'mention':3 'world':2"
What I would like is to see "#" preserved in front of 'mention' and # in front of 'hashtag'. Is there a way for me to do this using PostgreSQL ?
I'm not sure tsearch is the right solution for your use case. Tsearch is good at full-text search, but it sounds like you want relational data. Can you parse the data in your application and create tag/user relationships from #hashtags and #mentions?
Related
Recently, I implemented a PostgreSQL 11 full-text search on a huge table I have in a system to solve the problem of hitting LIKE queries in it. This table has over 200 million rows and querying using to_tsquery worked pretty well for the column of type tsvector.
Now I need to hit the following queries but reading the documentation I couldn't find how to do it (or it's there and I didn't understand because full-text search is something new to me yet)
Starts with
Ends with
How can I make the query below return true only if the query is "The cat" (starts with) and "the book" (ends with), if it's possible in full-text search.
select to_tsvector('The cat is on the book') ## to_tsquery('Cat')
I implemented a PostgreSQL 11 full-text search on a huge table I have in a system to solve the problem of hitting LIKE queries in it.
How did you do that? FTS doesn't apply for LIKE queries. It applies for FTS queries, such as ##.
You can't directly look for strings starting and ending with certain words. You can use the index to filter on cat and book, then refilter those rows for ones having them in the right place.
select * from whatever where tsv_col ## to_tsquery('cat & book') and text_col LIKE 'The cat % the book';
Unless you want to match something like 'The cathe book' then you would have to do something else, with two different LIKE.
I'm using Postgresql 13 and my problem was easily solved with #> operator like this:
select id from documents where keywords #> '{"winter", "report", "2020"}';
meaning that keywords array should contain all these elements. Also I've created a GIN index on this column.
Is it possible to achieve similar behavior even if I provide my request like '{"re", "202", "w"}' ? I heard that ngrams have semantics like this, but "intersection" capabilities of arrays are crucial for me.
In your example, the matches are all prefixes. Is that the general rule here? If so, you would probably went to use the match feature of full text search, not trigrams. It would require you reformat your data, or at least your query.
select * from
(values (to_tsvector('simple','winter report 2020'))) f(x)
where x## 're:* & 202:* & w:*'::tsquery;
If the strings can contain punctuation which you want preserved, you would need to take pains to properly format them into a quoted tsvector yourself rather than just letting to_tsvector deal with it. Using 'simple' config gets rid of the stemming and stop word removal features, which would interfere with what you want to do.
I have 2 tables (projects and tasks) that both contain a name field. I want users to be able to search both tables at the same time when entering a new item. I want to rank results based on all the terms entered. A user should be able to enter text in any order he/she chooses.
For example, searching on:
office bmt
should yield these results:
PR BMT Time - Office
BMT Office - Development
BMT Office - Development
...
The following search should also work:
BMT canter
should contain this result:
Canterburry - BMT time
So partial matches need to work too.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
I now use something like this:
where to_tsvector(projects.name || ' - ' || tasks.name) ## to_tsquery('OFF:*&BMT:*')
I build the search string itself in the Ruby backend by splitting the user entry according to its spaces.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
For example searching for:
off bmt
Gives results that don't contain Off at all because off is ignored completely.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
I could do it by building a list of AND statements in the WHERE clause with LIKE '% ... %' but that would probably hurt performance and doesn't support fuzzysearch.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
This could be very hard to do on more than a best-effort basis. If someone enters "Canter", how should the system know if they meant a shortening of Canterburry, or a misspelling of "cancer", or of "cantor", or if they really meant a horse's gait? Perhaps you can create a dictionary of common typos for your specific field? Also, without the specific knowledge that time zones are expected and common, "bmt" seems like a misspelling of, well, something.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
Don't just believe, check and see!
select to_tsquery('english','OFF:*&BMT:*');
to_tsquery
------------
'bmt':*
Yes indeed, to_tsquery does omit stop words, even with the :* thingy.
One option is to use 'simple' rather than 'english' as your configuration:
select to_tsquery('simple','OFF:*&BMT:*');
to_tsquery
-------------------
'off':* & 'bmt':*
Another option is to write tsquery directly rather than processing through to_tsquery. Note that in this case, you have to lower-case it yourself:
select 'off:*&bmt:*'::tsquery;
tsquery
-------------------
'off':* & 'bmt':*
Also note that if you do this with 'office:*', you will never get a match in an 'english' configuration, because 'office' in the document gets stemmed to 'offic', while no stemming occurs when you write 'office:*'::tsquery. So you could use 'simple' rather than 'english' to avoid both stemming and stop words. Or you could test each word in the query individually to see if it gets stemmed before deciding to add :* to it.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
What do you mean by fuzzysearch? You don't seem to be using that now. You are just using prefix matching, and accidentally using stemming and stopwords. How large is your table to be searched, and what kind of performance is acceptable?
If did you use ElasticSearch, how would you then phrase your searches? If you explained how you would phrase the search in ES, maybe someone can help you do the same thing in PostgreSQL. I don't think we can take it as a given that switching to ES will just magically do the right thing.
I could do it by building a list of AND statements in the WHERE clause
with LIKE '% ... %' but that would probably hurt performance and
doesn't support fuzzysearch.
Have you looked into pg_trgm? It can make those types of queries quite fast. Also, LIKE '%...%' is lot more fuzzy than what you are currently doing, so I don't understand how you will lose that. pg_trgm also provides the '<->' operator which is even fuzzier, and might be your best bet. It can deal with typos fairly well when embedded in long strings, but in short strings they can really be a problem.
In your case, to_tsquery() need to indicate that all words are required, you can use to_tsquery('english', 'off & bmt') and indicates a particular dictionary containing the 'off' word, listed in the link 4, below.
Some tips to use tsvector:
Create a field on your table that contains all fields with terms that you want to search, this field should be the type tsvector
Your search should use tsquery as you mentioned in your answer. In search, you can make some good tricks, like as follow:
2.a. Create a rank, with ts_rank(), indicating the search priority, this indicates the priority and how much the tsquery approximates with original terms
2.b. If you have specific words (like my case, search of chemical terms), you can create a dictionary with the commonly words used, this words can be used to extract radical or parts to compare the similarity.
2.c. About the performance: The tsquery works very well with gin and gist indexes. I have used full text search in a table with +200k registers and the search returns in < 0.4secs.
If you need more fuzzy search in words, you can also use the fuzzy match. I used with tsquery, the levenshtein_less_equal search, using a distance of 3. The function searches words with 3 or minus letters differing from the search, for unique words is a good way to search.
tsquery and tsvector: https://www.postgresql.org/docs/10/datatype-textsearch.html
text search: https://www.postgresql.org/docs/10/textsearch-controls.html#TEXTSEARCH-RANKING
Fuzzy: https://www.postgresql.org/docs/11/fuzzystrmatch.html#id-1.11.7.24.6
Lexize: https://www.postgresql.org/docs/10/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY
I am using SQLite FTS extension in my iOS application.
It performs well but the problem is that it matches only string prefixes (or starts with keyword search).
i.e.
This works:
SELECT FROM tablename WHERE columnname MATCH 'searchterm*'
but following two don't:
SELECT FROM tablename WHERE columnname MATCH '*searchterm'
SELECT FROM tablename WHERE columnname MATCH '\*searchterm\*'
Is there any workaround for this or any way to use FTS to build a query similar to LIKE '%searchterm%' query.
EDIT:
As pointed out by Retterdesdialogs, storing the entire text in reverse order and running a prefix search on a reverse string is a possible solution for ends with/suffix search problem, which was my original question, but it won't work for 'contains' search. I have updated the question accordingly.
In my iOS and Android applications, I have shied away from FTS search for exactly the reason that it doesn't support substring matches due to lack of suffix queries.
The workarounds seem complicated.
I have resorted to using LIKE queries, which while being less performant than MATCH, served my needs.
The workaround is to store the reverse string in an extra column. See this link (its not exactly the same it should give a idea):
Search Suffix using Full Text Search
To get it to work for contains queries, you need to store all suffixes of the terms you want to be able to search. This has the downside of making the database really large, but that can be avoided by compressing the data.
SQLite FTS contains and suffix matches
This question already has answers here:
Change postgres to case insensitive
(2 answers)
Closed last year.
I'm developing an app in Rails on OS X using PostgreSQL 8.4. I need to setup the database for the app so that standard text queries are case-insensitive. For example:
SELECT * FROM documents WHERE title = 'incredible document'
should return the same result as:
SELECT * FROM documents WHERE title = 'Incredible Document'
Just to be clear, I don't want to use:
(1) LIKE in the where clause or any other type of special comparison operators
(2) citext for the column datatype or any other special column index
(3) any type of full-text software like Sphinx
What I do want is to set the database locale to support case-insensitive text comparison. I'm on Mac OS X (10.5 Leopard) and have already tried setting the Encoding to "LATIN1", with the Collation and Ctype both set to "en_US.ISO8859-1". No success so far.
Any help or suggestions are greatly appreciated.
Thanks!
Update
I have marked one of the answers given as the correct answer out of respect for the folks who responded. However, I've chosen to solve this issue differently than suggested. After further review of the application, there are only a few instances where I need case-insensitive comparison against a database field, so I'll be creating shadow database fields for the ones I need to compare case-insensitively. For example, name and name_lower. I believe I came across this solution on the web somewhere. Hopefully PostgreSQL will allow similar collation options to what SQL Server provides in the future (i.e. DOCI).
Special thanks to all who responded.
You will likely need to do something like use a column function to convert your text e.g. convert to uppercase - an example :
SELECT * FROM documents WHERE upper(title) = upper('incredible document')
Note that this may mess up performance that used index scanning, but if it becomes a problem you can define an index including column functions on target columns e.g.
CREATE INDEX I1 on documents (upper(title))
With all the limitations you have set, possibly the only way to make it work is to define your own = operator for text. It is very likely that it will create other problems, such as creating broken indexes. Other than that, your best bet seems to be to use the citext datatype; that would still let the ORM stuff you're using generate the SQL.
(I am not mentioning the possibility of creating your own locale definition because I haven't ever heard of anyone doing it.)
Your problem and your exclusives are like saying "I want to swim, but I don't want to have to move my arms.".
You will drown trying.
I don't think that is what local or encoding is used for. Encoding is more for picking a character set and not determining how to deal with characters. If there were a setting it would be in the config, but I haven't seen one.
If you do not want to use ilike for fear of not being able to port to another database then I would suggest you look into what ORM options might be available with ActiveRecord if you are using that.
here is something from one of the top postgres guys: http://archives.postgresql.org/pgsql-php/2003-05/msg00045.php
edit: fixed specific references to locale.
SELECT * FROM documents WHERE title ~* 'incredible document'