How to properly structure a Multicolumn Index with a partial field search - postgresql

What is the best way to setup multicolumn index using the full_name column and the state column? The search will use the exact state with a partial search on the full_name column. The query will like this:
WHERE full_name ~* 'jones' AND state = 'CA';
Searching roughly 20 million records.
Thanks!
John

The state seems straight-forward enough -- a normal index should suffice. As far as the full name search, this is a lot of work, but with 20 million records, I think the dividends will speak for themselves.
Create a new fields in your table as a tsvector, and call it full_name_search for the sake of this example:
alter table <blah> add column full_name_search tsvector;
Do an initial population of the column:
update <blah>
set full_name_search = to_tsvector (full_name);
If possible, make the field non-nullable.
Create a trigger that will now automatically populate this field whenever it's updated:
create trigger <blah>_insert_update
before insert or update on <blah>
for each row execute procedure
tsvector_update_trigger(full_name_search,'pg_catalog.english',full_name);
Add an index on the new field:
create index <blah>_ix1 on <blah>
using gin(full_name_search);
From here, restructure the query to search on the tsvector field instead of the text field:
WHERE full_name_search ## to_tsquery('jones') AND state = 'CA';
You can take shortcuts on some of these steps (for example, don't create an extra field but use a function-based index instead), and it will get you improved performance, but not as good as what you can get.
One caveat -- I think the to_tsvector will split into vector components based on logical breaks in the contents, so this:
Catherine Jones Is a Nice Lady
will work fine, but this:
I've been Jonesing all day
Probably won't.

Related

fuzzy finding through database - prisma

I am trying to build a storage manager where users can store their lab samples/data. Unfortunately, this means that the tables will end up being quite dynamic, as each sample might have different data associated with it. I will still require users to define a schema, so I can display the data properly, however, I think this schema will have to be represented as a JSON field in the underlying database.
I was wondering, in Prisma, is there a way to fuzzy search through collections. Could I type something like help and then return all rows that match this expression ANYWHERE in their columns? (including the JSON fields). Could i do something like this at all with posgresql? Or with MongoDB?
thank you
You can easily do that with jsonb in PostgreSQL.
If you have a table defined like
CREATE TABLE userdata (
id bigint PRIMARY KEY,
important_col1 text,
important_col2 integer,
other_cols jsonb
);
You can create an index like this
CREATE INDEX ON userdata USING gin (other_cols);
and search efficiently with
SELECT id FROM userdata WHERE other_cols #> '{"attribute": "value"}';
Here, #> is the JSON containment operator in PostgreSQL.
Yes, in PostgreSQL you surely can do this. It's quite straightforward. Here is an example.
Let your table be called the_table aliased as tht. Cast an entire table row as text tht::text and use case insensitive regular expression match operator ~* to find rows that contain help in this text. You can use more elaborate and powerful regular expression for searching too.
Please note that since the ~* operator will defeat any index, this query will result in a sequential scan.
select * -- or whatever list of expressions you need
from the_table as tht
where tht::text ~* 'help';

Full-text with partitioning in PostgreSQL

I have a table that I want to search in.
Table:
user_id: integer
text: text
deleted_at: datetime (nullable)
Index:
CREATE INDEX CONCURRENTLY "index_notifications_full_text" ON "notifications"
USING "gist" (to_tsvector('simple'::REGCONFIG, COALESCE(("text")::TEXT, ''::TEXT))) WHERE "deleted_at" IS NULL;
I need to implement a full-text search for users (only inside their messages that are not deleted).
How can I implement an index that indexes both user_id and text?
Using the btree_gin and/or btree_gist extensions, you can include user_id directly into a multicolumn FTS index. You can try it on each type in turn, as it can be hard to predict which one will be better in a given situation.
Alternatively, you could partition the table by user_id using declarative partitioning, and then keep the single-column index (although in that case, GIN is likely better than GiST).
If you want more detailed advice, you need to give us more details. Like how many use_id are there, how many notifications per user_id, and many tokens are there per notification, and an example of a plausible query you hope to support efficiently.
You can add a new column with the name e.g document_with_idx with tsvector type
on your notifications table,
ALTER TABLE notifications ADD COLUMN document_with_idx tsvector;
Then update that column value with the vectorized value of user_id and text column.
update notifications set document_with_idx = to_tsvector(user_id || ' ' || coalesce(text, ''));
Finally, create an index with the name e.g document_idx on that column,
CREATE INDEX document_idx
ON notifications
USING GIN (document_with_idx);
Now you can do a full-text search on both user_id and text column value using that document_with_idx column.
Now search like,
select user_id, text from notifications
where document_with_idx ## to_tsquery('your search string goes here');
See more: https://www.postgresql.org/docs/9.5/textsearch-tables.html

Fast way to check if PostgreSQL jsonb column contains certain string

The past two days I've been reading a lot about jsonb, full text search, gin index, trigram index and what not but I still can not find a definitive or at least a good enough answer on how to fastly search if a row of type JSONB contains certain string as a value. Since it's a search functionality the behavior should be like that of ILIKE
What I have is:
Table, lets call it app.table_1 which contains a lot of columns one of which is of type JSONB, so lets call it column_jsonb
The data inside column_jsonb will always be flatten (no nested objects, etc) but the keys can vary. An example of the data in the column with obfuscated values looks like this:
"{""Key1"": ""Value1"", ""Key2"": ""Value2"", ""Key3"": null, ""Key4"": ""Value4"", ""Key5"": ""Value5""}"
I have a GIN index for this column which doesn't seems to affect the search time significantly (I am testing with 20k records now which takes about 550ms). The indes looks like this:
CREATE INDEX ix_table_1_column_jsonb_gin
ON app.table_1 USING gin
(column_jsonb jsonb_path_ops)
TABLESPACE pg_default;
I am interested only in the VALUES and the way I am searching them now is this:
EXISTS(SELECT value FROM jsonb_each(column_jsonb) WHERE value::text ILIKE search_term)
Here search_term is variable coming from the front end with the string that the user is searching for
I have the following questions:
Is it possible to make the check faster without modifying the data model? I've read that trigram index might be usfeul for similar cases but at least for me it seems that converting jsonb to text and then checking will be slower and actually I am not sure if the trigram index will actually work if the column original type is JSONB and I explicitly cast each row to text? If I'm wroing I would really appreciate some explanation with example if possible.
Is there some JSONB function that I am not aware of which offers what I am searching for out of the box, I'm constrained to PostgreSQL v 11.9 so some new things coming with version 12 are not available for me.
If it's not possible to achieve significant improvement with the current data structure can you propose a way to restructure the data in column_jsonb maybe another column of some other type with data persisted in some other way, I don't know...
Thank you very much in advance!
If the data structure is flat, and you regularly need to search the values, and the values are all the same type, a traditional key/value table would seem more appropriate.
create table table1_options (
table1_id bigint not null references table1(id),
key text not null,
value text not null
);
create index table1_options_key on table1_options(key);
create index table1_options_value on table1_options(value);
select *
from table1_options
where value ilike 'some search%';
I've used simple B-Tree indexes, but you can use whatever you need to speed up your particular searches.
The downsides are that all values must have the same type (doesn't seem to be a problem here) and you need an extra table for each table. That last one can be mitigated somewhat with table inheritance.

GIST Index Expression based on Geography Type Column Problems

I have question about how postgresql use indexes. I have problems with Gist Index Expression based on Geography Type Column in Postgresql with Postgis enabled database.
I have the following table:
CREATE TABLE place
(
id serial NOT NULL,
name character varying(40) NOT NULL,
location geography(Point,4326),
CONSTRAINT place_pkey PRIMARY KEY (id )
)
Then I created Gist Index Expression based on column "location"
CREATE INDEX place_buffer_5000m ON place
USING GIST (ST_BUFFER(location, 5000));
Now suppose that in table route I have column shape with Linestring object and I want to check which 5000m polygons (around the location) the line crosses.
The query below in my opinion shoud use the "place_buffer_5000m" index but does not use it.
SELECT place.name
FROM place, route
WHERE
route.id=1 AND
ST_CROSSES(route.shape::geometry, ST_BUFFER(place.location, 5000)::geometry))
Table place have about 76000 rows. Analyze and Vacuum was run on this table with recreating "place_buffer_5000m" index but the index is not used during the above query.
What is funny when I create another column in table place named "area_5000m" (geograpthy type) and update the table like:
UPDATE place SET area_5000m=ST_BUFFER(location, 5000)
And then create gist index for this column like this:
CREATE INDEX place_area_5000m ON place USING GIST (area_5000m)
Then using the query:
SELECT place.name
FROM place, route
WHERE
route.id=1 AND
ST_CROSSES(route.shape::geometry, place.area_5000m::geometry))
The index "place_area_5000m" is used.
The question is why the Index expression that is calculated based on location column is not used?
Did you try to add a cast to your "functional index"?
This could help to determine the data type.
It should work with geometry and probably also for geography, like this:
CREATE INDEX place_buffer_5000m ON place
USING GIST(ST_BUFFER(location, 5000)::geometry);
Ultimately, you want to know what routes are within 5 km of places, which is a really simple and common type of query. However, you are falling into a common trap: don't use ST_Buffer to filter! It is expensive!
Use ST_DWithin, which will use a regular GiST index (if available):
SELECT place.name
FROM place, route
WHERE route.id = 1 AND ST_DWithin(route.shape::geography, place.location, 5000);

Optimize getting counts of rows grouped by first letter in SQLite?

My current query looks something like this:
SELECT SUBSTR(name,1,1), COUNT(*) FROM files GROUP BY SUBSTR(name,1,1)
But it's taking a pretty long time just to do counts on a table that's already indexed by the name column. I saw from this question that some engines might not use indexes correctly for the SUBSTR function, and in fact, sqlite will not use indexes for SUBSTR(string,1,1).
Is there any other approach that would utilize the index and net me some faster queries?
One strategy that is consistent with your access pattern is to add a new indexed column "first_letter" to your table. Use a trigger on to set the value on insert and update. Then your query is a simple group by first_letter.
Another strategy is to create a shadow table which contains an aggregation of the mother table. This isn't easy because it is your job as developer to keep the shadow table consistent with the mother table. Every delete, update or insert in table files needs to be accompanied by a change in the shadow table.
Databases like Oracle have support for materialized views to achieve this automatically but sqlite doesn't.