PostgreSQL: Full text search multitenant site, plus only parts of site - postgresql

I'm developing a multitenant web application, and I want to add full text search, so that people will be able to:
1) search only the site they are currently visiting (but not all sites), and
2) search only a section of that site (e.g. restrict search to a blog or a forum on the site), and
3) search a single forum thread only.
I wonder what indexes should I add?
Please assume the database is huge (so that e.g. index-scanning-by-site-ID and then filtering-by-full-text-search is too slow).
I can think of three approaches:
Create three indexes. 1) One that indexes everything on a per site basis.
And 2) one that indexes everything on a per-site plus site-section basis.
And 3) one that indexes everything on a per-site and page-id basis.
Create one single index, and insert into [the text to index] magic words like:
"site_<site-id>"
and "section_<section-id>" and "page_<page-id>", and then when I search
for section XX in site YYY I could prefix the search query like so:
"site_XX AND section_YYY AND ...".
Dynamically add database indexes when a new site or site section is created:
create index dw1_posts__search_site_YYY
on dw1_posts using gin(to_tsvector('english', approved_text))
where site_id = 'YYY';
Does any of these three approaches above make sense? Are there better alternatives?
(Details: However, perhaps approach 1 is impossible? Attempting to index-a-column and also index-for-full-text-searching at the same time, results in syntax errors:
> create index dw1_posts__search_site
on dw1_posts (site_id)
using gin(to_tsvector('english', approved_text));
ERROR: syntax error at or near "using"
LINE 1: ...dex dw1_posts__search_site on dw1_posts(site_id) using gin(...
^
> create index dw1_posts__search_site
on dw1_posts
using gin(to_tsvector('english', approved_text))
(site_id);
ERROR: syntax error at or near "("
LINE 1: ... using gin(to_tsvector('english', approved_text)) (site_id);
(If approach 1 was possible, then I could do queries like:
select ... from ... where site_id = ... and <full-text-search-column> ## <query>;
and have PostgreSQL first check site_id and then the full-text-search column, using one single index.)
)
/ End details.)
Update, one week later: I'm using ElasticSearch instead. I got the impression that no scalable solution exists, for faceted search, with relational databases / PostgreSQL. And integrating with ElasticSearch seems to be roughly as simple as implementing and testing and tweaking the approaches suggested here. (For example, PostgreSQL's stemmer/whatever-it's-called might split "section_NNN" into two words: "section" and "NNN" and thus index words that doesn't exist on the page! Tricky to fix such small annoying issues.)

The normal approach would be to create:
one full text index:
CREATE INDEX idx1
ON dw1_posts USING gin(to_tsvector('english', approved_text));
a simple index on the site_id:
CREATE INDEX idx2
on dw1_posts(page_id);
another simple index on the page_id:
CREATE INDEX idx3
on dw1_posts(site_id);
Then it's the SQL planner's business to decide which ones to use if any, and in what order depending on the queries and the distribution of values in the columns. There is no point in trying to outsmart the planner before you've actually witnessed slow queries.

Another alternative, which is similar to the "site_<site-id>" and "section_<section-id>" and "page_<page-id>" alternative, should be to prefix the text to index with:
SiteSectionPage_<site-id>_<section-id>_<subsection-id>_<page-id>
And then use prefix matching (i.e. :*) when searching:
select ... from .. where .. ## 'SiteSectionPage_NN_MMM:* AND (the search phrase)'
where NN is the site ID and MMM is the section ID.
But this won't work with Chinese? I think trigrams are appropriate when indexing Chinese, but then SiteSectionPage... will be split into: Sit, ite, teS, eSe, which makes no sense.

Related

PostgreSQL: to_tsquery starts with and ends with search

Recently, I implemented a PostgreSQL 11 full-text search on a huge table I have in a system to solve the problem of hitting LIKE queries in it. This table has over 200 million rows and querying using to_tsquery worked pretty well for the column of type tsvector.
Now I need to hit the following queries but reading the documentation I couldn't find how to do it (or it's there and I didn't understand because full-text search is something new to me yet)
Starts with
Ends with
How can I make the query below return true only if the query is "The cat" (starts with) and "the book" (ends with), if it's possible in full-text search.
select to_tsvector('The cat is on the book') ## to_tsquery('Cat')
I implemented a PostgreSQL 11 full-text search on a huge table I have in a system to solve the problem of hitting LIKE queries in it.
How did you do that? FTS doesn't apply for LIKE queries. It applies for FTS queries, such as ##.
You can't directly look for strings starting and ending with certain words. You can use the index to filter on cat and book, then refilter those rows for ones having them in the right place.
select * from whatever where tsv_col ## to_tsquery('cat & book') and text_col LIKE 'The cat % the book';
Unless you want to match something like 'The cathe book' then you would have to do something else, with two different LIKE.

Sphinx / Manticore - base one plain index off another?

I have a plain text index that sucks data from MySQL and inserts it into Manticore in a format I need (e.g. converting datetime strings to timestamp, CONCATing some fields etc.
I then want to create a second plain text index based off this data to group it further. This will save me having to either re-run the normalisation that's done to the first index on INSERT or make it easier for me to query in the future.
For example, my first index is a list of all phone calls that have been made / received (telephone number, duration, agent). The second index should group by Year-Month-Date in such a way that I can see how many calls each agent made on that day. This means I end up with idx_phone_calls and idx_phone_calls_by_date.
Currently, I generate the first index from MySQL, then get Manticore to query itself (by setting the MySQL host to localhost. It works, but it feels as though I should be able to query Manticore directly from within the index. However, I'm struggling to find if that's possible.
Is there a better way to do it?
Well Sphinx/Manticore, has its own GROUP BY function. So maybe can just run the final query against the original index anyway, avoid the need for the second index.
Sphinx's Aggregation (in some way) is more powerful than MySQL, and can do some 'super aggregation' functions (like with WITHIN GROUP ORDER BY)
But otherwise there is no direct way to create an off another (eg there is no CREATE TABLE idx_phone_calls_by_date SELECT ... FROM idx_phone_calls ... )
Your 'solution' of directing indexer to query the data from searchd is good. In general this should be pretty efficent, particully on localhost, there is little overhead. Maintains the logical seperation of searchd being for queries, indexer being for well building indexes.

Efficient way to find ordered string's exact, prefix and postfix match in PostgreSQL

Given a table name table and a string column named column, I want to search for the word word in that column in the following way: exact matches be on top, followed by prefix matches and finally postfix matches.
Currently I got the following solutions:
Solution 1:
select column
from (select column,
case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end as rank
from table) as ranked
where rank is not null
order by rank;
Solution 2:
select column
from table
where column like 'word'
or column like 'word%'
or column like '%word'
order by case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end;
Now my question is which one of the two solutions are more efficient or better yet, is there a solution better than both of them?
Your 2nd solution looks simpler for the planner to optimize, but it is possible that the first one gets the same plan as well.
For the Where, is not needed as it is covered by ; it might confuse the DB to do 2 checks instead of one.
But the biggest problem is the third one as this has no way to be optimized by an index.
So either way, PostgreSQL is going to scan your full table and manually extract the matches. This is going to be slow for 20,000 rows or more.
I recommend you to explore fuzzy string matching and full text search; looks like that is what you're trying to emulate.
Even if you don't want the full power of FTS or fuzzy string matching, you definitely should add the extension "pgtrgm", as it will enable you to add a GIN index on the column that will speedup LIKE '%word' searches.
https://www.postgresql.org/docs/current/pgtrgm.html
And seriously, have a look to FTS. It does provide ranking. If your requirements are strict to what you described, you can still perform the FTS query to "prefilter" and then apply this logic afterwards.
There are tons of introduction articles to PostgreSQL FTS, here's one:
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
And even I wrote a post recently when I added FTS search to my site:
https://deavid.wordpress.com/2019/05/28/sedice-adding-fts-with-postgresql-was-really-easy/

Create index on first 3 characters (area code) of phone field?

I have a Postgres table with a phone field stored as varchar(10), but we search on the area code frequently, e.g.:
select * from bus_t where bus_phone like '555%'
I wanted to create an index to facilitate with these searches, but I got an error when trying:
CREATE INDEX bus_ph_3 ON bus_t USING btree (bus_phone::varchar(3));
ERROR: 42601: syntax error at or near "::"
My first question is, how do I accomplish this, but also I am wondering if it makes sense to index on the first X characters of a field or if indexing on the entire field is just as effective.
Actually, a plain B-tree index is normally useless for pattern matching with LIKE (~~) or regexp (~), even with left-anchored patterns, if your installation runs on any other locale than "C", which is the typical case. Here is an overview over pattern matching and indices in a related answer on dba.SE
Create an index with the varchar_pattern_ops operator class (matching your varchar column) and be sure to read the chapter on operator classes in the manual.
CREATE INDEX bus_ph_pattern_ops_idx ON bus_t (bus_phone varchar_pattern_ops);
Your original query can use this index:
... WHERE bus_phone LIKE '555%'
Performance of a functional index on the first 3 characters as described in the answer by #a_horse is pretty much the same in this case.
-> SQLfiddle demo.
Generally a functional index on relevant leading characters would be be a good idea, but your column has only 10 characters. Consider that the overhead per tuple is already 28 bytes. Saving 7 bytes is just not substantial enough to make a big difference. Add the cost for the function call and the fact that xxx_pattern_ops are generally a bit faster.
In Postgres 9.2 or later the index on the full column can also serve as covering index in index-only scans.
However, the more characters in the columns, the bigger the benefit from a functional index.
You may even have to resort to a prefix index (or some other kind of hash) if the strings get too long. There is a maximum length for indices.
If you decide to go with the functional index, consider using the xxx_pattern_ops variant for a small additional performance benefit. Be sure to read about the pros and cons in the manual and in Peter Eisentraut's blog entry:
CREATE INDEX bus_ph_3 ON bus_t (left(bus_phone, 3) varchar_pattern_ops);
Explain error message
You'd have to use the standard SQL cast syntax for functional indices. This would work - pretty much like the one with left(), but like #a_horse I'd prefer left().
CREATE INDEX bus_ph_3 ON bus_t USING btree (cast(bus_phone AS varchar(3));
When using like '555%' an index on the complete column will be used just as well. There is no need to only index the first three characters.
If you do want to index only the first 3 characters (e.g. to save space), then you could use the left() funcion:
CREATE INDEX bus_ph_3 ON bus_t USING btree (left(bus_phone,3));
But in order for that index to be used, you would need to use that expression in your where clause:
where left(bus_phone,3) = '555';
But again: that is most probably overkill and the index on the complete column will be good enough and can be used for other queries as well e.g. bus_phone = '555-1234' which the index on just the first three characters would not.

Postgres - gin index doesn't work

it seems that my server won't use gin index.
I've created a new database with one table.
I've inserted one row as example.
I've loaded trigram extension and created gin index using trigrams
But when I check if the index works right I can see it doesn't
Any ideas?
SQL: http://pastebin.com/1yDQQA1Z
P.S. A day ago I've followed a tutorial about trigrams. Basically it was the same like my example above. The table had 2 columns, numeric(5, 0) and character varying (the one with gin trgm index). Query was with like operator using "%" and index was working (I could see Bitmap using in query explain), so I know, my server can use index (and its properly installed).
Thanks in advance.
Don't test on one row, it is meaningless.
Here's an excerpt of the documentation explaining why, in Examining Index Usage:
Use real data for experimentation. Using test data for setting up
indexes will tell you what indexes you need for the test data, but
that is all.
It is especially fatal to use very small test data sets. While
selecting 1000 out of 100000 rows could be a candidate for an index,
selecting 1 out of 100 rows will hardly be, because the 100 rows
probably fit within a single disk page, and there is no plan that can
beat sequentially fetching 1 disk page.