Full-text with partitioning in PostgreSQL - postgresql

I have a table that I want to search in.
Table:
user_id: integer
text: text
deleted_at: datetime (nullable)
Index:
CREATE INDEX CONCURRENTLY "index_notifications_full_text" ON "notifications"
USING "gist" (to_tsvector('simple'::REGCONFIG, COALESCE(("text")::TEXT, ''::TEXT))) WHERE "deleted_at" IS NULL;
I need to implement a full-text search for users (only inside their messages that are not deleted).
How can I implement an index that indexes both user_id and text?

Using the btree_gin and/or btree_gist extensions, you can include user_id directly into a multicolumn FTS index. You can try it on each type in turn, as it can be hard to predict which one will be better in a given situation.
Alternatively, you could partition the table by user_id using declarative partitioning, and then keep the single-column index (although in that case, GIN is likely better than GiST).
If you want more detailed advice, you need to give us more details. Like how many use_id are there, how many notifications per user_id, and many tokens are there per notification, and an example of a plausible query you hope to support efficiently.

You can add a new column with the name e.g document_with_idx with tsvector type
on your notifications table,
ALTER TABLE notifications ADD COLUMN document_with_idx tsvector;
Then update that column value with the vectorized value of user_id and text column.
update notifications set document_with_idx = to_tsvector(user_id || ' ' || coalesce(text, ''));
Finally, create an index with the name e.g document_idx on that column,
CREATE INDEX document_idx
ON notifications
USING GIN (document_with_idx);
Now you can do a full-text search on both user_id and text column value using that document_with_idx column.
Now search like,
select user_id, text from notifications
where document_with_idx ## to_tsquery('your search string goes here');
See more: https://www.postgresql.org/docs/9.5/textsearch-tables.html

Related

fuzzy finding through database - prisma

I am trying to build a storage manager where users can store their lab samples/data. Unfortunately, this means that the tables will end up being quite dynamic, as each sample might have different data associated with it. I will still require users to define a schema, so I can display the data properly, however, I think this schema will have to be represented as a JSON field in the underlying database.
I was wondering, in Prisma, is there a way to fuzzy search through collections. Could I type something like help and then return all rows that match this expression ANYWHERE in their columns? (including the JSON fields). Could i do something like this at all with posgresql? Or with MongoDB?
thank you
You can easily do that with jsonb in PostgreSQL.
If you have a table defined like
CREATE TABLE userdata (
id bigint PRIMARY KEY,
important_col1 text,
important_col2 integer,
other_cols jsonb
);
You can create an index like this
CREATE INDEX ON userdata USING gin (other_cols);
and search efficiently with
SELECT id FROM userdata WHERE other_cols #> '{"attribute": "value"}';
Here, #> is the JSON containment operator in PostgreSQL.
Yes, in PostgreSQL you surely can do this. It's quite straightforward. Here is an example.
Let your table be called the_table aliased as tht. Cast an entire table row as text tht::text and use case insensitive regular expression match operator ~* to find rows that contain help in this text. You can use more elaborate and powerful regular expression for searching too.
Please note that since the ~* operator will defeat any index, this query will result in a sequential scan.
select * -- or whatever list of expressions you need
from the_table as tht
where tht::text ~* 'help';

How to properly structure a Multicolumn Index with a partial field search

What is the best way to setup multicolumn index using the full_name column and the state column? The search will use the exact state with a partial search on the full_name column. The query will like this:
WHERE full_name ~* 'jones' AND state = 'CA';
Searching roughly 20 million records.
Thanks!
John
The state seems straight-forward enough -- a normal index should suffice. As far as the full name search, this is a lot of work, but with 20 million records, I think the dividends will speak for themselves.
Create a new fields in your table as a tsvector, and call it full_name_search for the sake of this example:
alter table <blah> add column full_name_search tsvector;
Do an initial population of the column:
update <blah>
set full_name_search = to_tsvector (full_name);
If possible, make the field non-nullable.
Create a trigger that will now automatically populate this field whenever it's updated:
create trigger <blah>_insert_update
before insert or update on <blah>
for each row execute procedure
tsvector_update_trigger(full_name_search,'pg_catalog.english',full_name);
Add an index on the new field:
create index <blah>_ix1 on <blah>
using gin(full_name_search);
From here, restructure the query to search on the tsvector field instead of the text field:
WHERE full_name_search ## to_tsquery('jones') AND state = 'CA';
You can take shortcuts on some of these steps (for example, don't create an extra field but use a function-based index instead), and it will get you improved performance, but not as good as what you can get.
One caveat -- I think the to_tsvector will split into vector components based on logical breaks in the contents, so this:
Catherine Jones Is a Nice Lady
will work fine, but this:
I've been Jonesing all day
Probably won't.

Indexing jsonb data for pattern matching searches

This is a follow-up to:
Pattern matching on jsonb key/value
I have a table as follows
CREATE TABLE "PreStage".transaction (
transaction_id serial NOT NULL,
transaction jsonb
CONSTRAINT pk_transaction PRIMARY KEY (transaction_id)
);
The content in my transaction jsonb column looks like
{"ADDR": "abcd", "CITY": "abcd", "PROV": "",
"ADDR2": "",
"ADDR3": "","CNSNT": "Research-NA", "CNTRY": "NL", "EMAIL": "#.com",
"PHONE": "12345", "HCO_NM": "HELLO", "UNQ_ID": "",
"PSTL_CD": "1234", "HCP_SR_NM": "", "HCP_FST_NM": "",
"HCP_MID_NM": ""}
I need search query like:
SELECT transaction AS data FROM "PreStage".transaction
WHERE transaction->>'HCP_FST_NM' ILIKE '%neer%';
But I need to give my user flexibility to search any key/value on the fly.
An answer to the previous question suggested to create index as:
CREATE INDEX idxgin ON "PreStage".transaction
USING gin ((transaction->>'HCP_FST_NM') gin_trgm_ops);
Which works, but I wanted to index other keys, too. Hence was trying something like:
CREATE INDEX idxgin ON "PreStage".transaction USING gin
((transaction->>'HCP_FST_NM'),(transaction->>'HCP_LST_NM') gin_trgm_ops)
Which doesn't work. What would be the best indexing approach here or will I have to create a separate index for each key in which case the approach will not be generic if a new key/value pair is added to the data.
The syntax error that #jjanes pointed out aside,
for a mix of some popular keys (contained in many rows and / or searched often) plus many more rare keys (contained in few rows and / or rarely searched, new keys might pop up dynamically) I suggest this combination:
Trigram indexes for popular keys
It does not seem like you are going to combine multiple keys in one search often, and a single index with many keys would grow very big and slow. So I would create a separate index for each popular key. Make it a partial index for keys that are not contained in most rows:
CREATE INDEX trans_idxgin_HCP_FST_NM ON transaction -- contained in most rows
USING gin ((transaction->>'HCP_FST_NM') gin_trgm_ops);
CREATE INDEX trans_idxgin_ADDR ON transaction -- not in most rows
USING gin ((transaction->>'ADDR') gin_trgm_ops)
WHERE transaction ? 'ADDR';
Etc. Like detailed in my previous answer:
Pattern matching on jsonb key/value
Basic jsonb GIN index
If you have many different keys and / or new keys are added dynamically, you can cover the rest with a basic (default) jsonb_ops GIN index:
CREATE INDEX trans_idxgin ON "PreStage".transaction USING gin (transaction);
Among other things, this supports the search for keys. But you cannot use it for pattern matching on values.
What's the proper index for querying structures in arrays in Postgres jsonb?
Query
Combine predicates addressing both indexes:
SELECT transaction AS data
FROM "PreStage".transaction
WHERE transaction->>'HCP_FST_NM' ILIKE '%neer%'
AND transaction ? 'HCP_FST_NM'; -- even if that seems redundant.
The second condition happens to match our partial indexes as well.
So either there is a specific trigram index for the given (popular / common) key, or there is at least an index to find (the few) rows containing the rare key - and then filter for matching values. The same query should give you the best of both worlds.
Be sure to run the latest version of Postgres, there have been various updates for cost estimates recently. It will be crucial that Postgres works with good estimates and current table statistics to choose the best query plan.
There is no built in index that does precisely what you want, searching for an exact key and a corresponding wild-card matching value, without specifying ahead of time which key(s) to use. It should be possible to create an extension which would do this, but it would be an awful lot of work, and I don't know of any that exist.
Your best option that works out of the box might be to cast the jsonb to text and index that text:
create index on transaction using gin ((transaction::text) gin_trgm_ops);
And then add a secondary condition to your query:
SELECT transaction AS data FROM transaction
WHERE transaction->>'HCP_FST_NM' ILIKE '%neer%'
AND transaction::text ilike '%neer%';
Now it can use the index to find anything containing 'neer', and then later re-check that 'neer' occurs in the value for the 'HCP_FST_NM' key, as opposed to just some other place in the JSONB.
If your query word occurs in lots of places other than in the value of the desired key, then this might not give you very good performance. For example, if someone searched for:
transaction->>'EMAIL' ilike '%ADDR%'
AND transaction::text ilike '%ADDR%';
The the index would return every row, assuming all records have the same structure as what you show, because every row contains 'ADDR' because used as a key. Then every row would fail the other condition check, but only after doing a lot of work.

How multiple indexes in postgres work on the same column

I was wondering I'm not really sure how multiple indexes would work on the same column.
So lets say I have an id column and a country column. And on those I have an index on id and another index on id and country. When I do my query plan it looks like its using both those indexes. I was just wondering how that works? Can I force it to use just the id and country index.
Also is it bad practice to do that? When is it a good idea to index the same column multiple times?
It is common to have indexes on both (id) and (country,id), or alternatively (country) and (country,id) if you have queries that benefit from each of them. You might also have (id) and (id, country) if you want the "covering" index on (id,country) to support index only scans, but still need the stand along to enforce a unique constraint.
In theory you could just have (id,country) and still use it to enforce uniqueness of id, but PostgreSQL does not support that at this time.
You could also sensibly have different indexes on the same column if you need to support different collations or operator classes.
If you want to force PostgreSQL to not use a particular index to see what happens with it gone, you can drop it in a transactions then roll it back when done:
BEGIN; drop index table_id_country_idx; explain analyze select * from ....; ROLLBACK;

Indexing strategy for full text search in a multi-tenant PostgreSQL database

I have a PostgreSQL database that stores a contact information table (first, last names) for multiple user accounts. Each contact row has a user id column. What would be the most performant way to set up indexes so that users could search for the first few letters of the first or last name of their contacts?
I'm aware of conventional b-tree indexing and PG-specific GIN and GiST, but I'm just not sure how they could (or could not) work together such that a user with just a few contacts doesn't have to search all of the contacts before filtering by user_id.
You should add the account identifier as the first column of any index you create. This will in effect first narrow down the search to rows belonging to that account. For gist or gin fulltext indexes you will need to install the btree_gist or btree_gin extensions.
If you only need to search for the first letters, the simplest and probably fastest would be to use a regular btree that supports text operations for both columns and do 2 lookups. You'll need to use the text_pattern_ops opclass to support text prefix queries and lower() the fields to ensure case insensitivity:
CREATE INDEX contacts_firstname_idx ON contacts(aid, lower(firstname) text_pattern_ops);
CREATE INDEX contacts_lastname_idx ON contacts(aid, lower(lastname) text_pattern_ops);
The query will then look something like this:
SELECT * FROM contacts WHERE aid = 123 AND
(lower(firstname) LIKE 'an%' OR lower(lastname) LIKE 'an%')