I need to store a few hundred thousand HTML documents in a database and be able to search them. But not just for content - I need the searches to match class names, script names and id values (among other things) that might appear as attributes within the HTML tags in the documents. I tried using to_tsvector('english', tableColumn) and to_tsvector('simple', tableColumn) but neither seem to match the contents of attributes in tags. Specifically, I did this:
create index an_index on myTable using gin (to_tsvector('simple',tableColumn))
and then:
select url from myTable where to_tsvector ('simple', tableContent) ## to_tsquery ('myscript.js')
I expected it to retrieve all documents that contained a reference to myscript.js. But it returns no results.
Is it possible to achieve the results I want using the full-text search?
Thanks in advance for your help.
Try instead.
SELECT url FROM myTable WHERE tableColumn ## to_tsquery ('simple','myscript.js')
Related
I'm trying to write a query in sql to exclude a keyword:
It's a list of cities written out (e.g. AnnArbor-MI). In the list there are duplicates because some have the word 'badsetup' after the city and these need to be discarded. How would I write something to exclude any city with 'badsetup' after it?
Your question title and content appear to be asking for two different things ...
Query cities while excluding the trailing 'badsetup':
SELECT regexp_matches(citycolumn, '(.*)badsetup')
FROM mytable;
Query cities that don't have the trailing 'badsetup':
SELECT citycolumn
FROM mytable
WHERE citycolumn NOT LIKE '%badsetup';
In psql, to select rows excluding those with the word 'badsetup' you can use the following:
SELECT * FROM table_name WHERE column NOT LIKE '%badsetup%';
In this case the '%' indicates that there can be any characters of any length in this space. So this query will find any instance of the phrase 'badsetup' in your column, regardless of the characters before or after it.
You can find more information in section 9.7.1 here: https://www.postgresql.org/docs/8.3/static/functions-matching.html
I'm trying to create a PostgreSQL query to find a partial text inside a tsvector column.
I have a tsvector value like this "'89' 'TT7' 'test123'" and I need to find any rows that contains "%es%".
How can I do that?
I tried
select * from use_docs_conteudo
WHERE textodados ## to_tsquery('es')
It looks like you want to use fast ILIKE queries for wild match. pg_trgm will be the right tool to go with. You can use POSIX regex rules for defining your query.
WITH data(t) AS ( VALUES
('test123! TT7 89'::TEXT),
('test123, TT7 89'::TEXT),
('test#test123.domain TT7 89'::TEXT)
)
SELECT count(*) FROM data WHERE t ~* 'es' AND t ~* '\mtest123\M';
Result:
count
-------
3
(1 row)
Links for existing answers:
Postgresql full text search part of words
PostgreSQL: Full Text Search - How to search partial words?
I have a posts that has a column tags. I'd like to be able to do full text search across the tags. For VARCHAR columns I've used:
CREATE INDEX posts_fts_idx ON posts USING gin(to_tsvector('english', coalesce(title, ''));
SELECT "posts".* FROM "posts" WHERE (to_tsvector('english', coalesce(title, '')) ## (to_tsquery('english', 'ruby')));
However, for character varying[] the function to_tsvector does not exist. How can a query be written that will run against each of the tags (ideally matching if any single tag matches)?
Note: I see that it would be pretty easy to do a conversion to a string (array_to_string) but if possible I'd like to convert each individual tag to a tsvector.
You could index the character varying using gin for search options. Try this :
CREATE INDEX idx_post_tag ON posts USING GIN(tags);
SELECT * FROM posts WHERE tags #> (ARRAY['search string'::character varying]);
This is when an exact match is desired. If an exact match is not desired, you should consider storing your tags as a text column. Think more on the significance of these 'tags'. String array types lack text indexing, stemming and inflection support, and hence you won't be able to match bates such as 'Dancing' with 'Dance'.
If that is not an option, you could circumvent this with an immutable version of array_to_string function. Your queries would then be :
CREATE INDEX posts_fts_idx ON posts USING gin(to_tsvector('english', immutable_array_to_string(tags, ' ')));
SELECT "posts".* FROM "posts" WHERE (to_tsvector('english', immutable_array_to_string(tags, ' ')) ## (to_tsquery('english', 'ruby')));
I know about full-text search, but that only matches your query against individual words. I want to select strings that contain a word that starts with words in my query. For example, if I search:
appl
the following should match:
a really nice application
apples are cool
appliances
since all those strings contains words that start with appl. In addition, it would be nice if I could select the number of words that match, and sort based on that.
How can I implement this in PostgreSQL?
Prefix matching with Full Text Search
FTS supports prefix matching. Your query works like this:
SELECT * FROM tbl
WHERE to_tsvector('simple', string) ## to_tsquery('simple', 'appl:*');
Note the appended :* in the tsquery. This can use an index.
See:
Get partial match from GIN indexed TSVECTOR column
Alternative with regular expressions
SELECT * FROM tbl
WHERE string ~ '\mappl';
Quoting the manual here:
\m .. matches only at the beginning of a word
To order by the count of matches, you could use regexp_matches()
SELECT tbl_id, count(*) AS matches
FROM (
SELECT tbl_id, regexp_matches(string, '\mappl', 'g')
FROM tbl
WHERE string ~ '\mappl'
) sub
GROUP BY tbl_id
ORDER BY matches DESC;
Or regexp_split_to_table():
SELECT tbl_id, string, count(*) - 1 AS matches
FROM (
SELECT tbl_id, string, regexp_split_to_table(string, '\mappl')
FROM tbl
WHERE string ~ '\mappl'
) sub
GROUP BY 1, 2
ORDER BY 3 DESC, 2, 1;
db<>fiddle here
Old sqlfiddle
Postgres 9.3 or later has index support for simple regular expressions with a trigram GIN or GiST index. The release notes for Postgres 9.3:
Add support for indexing of regular-expression searches in pg_trgm
(Alexander Korotkov)
See:
PostgreSQL LIKE query performance variations
Depesz wrote a blog about index support for regular expressions.
SELECT * FROM some_table WHERE some_field LIKE 'appl%' OR some_field LIKE '% appl%';
As for counting the number of words that match, I believe that would be too expensive to do dynamically in postgres (though maybe someone else knows better). One way you could do it is by writing a function that counts occurrences in a string, and then add ORDER BY myFunction('appl', some_field). Again though, this method is VERY expensive (i.e. slow) and not recommended.
For things like that, you should probably use a separate/complimentary full-text search engine like Sphinx Search (google it), which is specialized for that sort of thing.
An alternative to that, is to have another table that contains keywords and the number of occurrences of those keywords in each string. This means you need to store each phrase you have (e.g. really really nice application) and also store the keywords in another table (i.e. really, 2, nice, 1, application, 1) and link that keyword table to your full-phrase table. This means that you would have to break up strings into keywords as they are entered into your database and store them in two places. This is a typical space vs speed trade-off.
I have a query with a number of test fields something like this:
SELECT * FROM some-table
WHERE field1 ILIKE "%thing%"
OR field2 ILIKE "%thing"
OR field3 ILIKE "%thing";
The columns are pretty much all varchar(50) or thereabouts. Now I understand to improve performance I should index the fields upon which the search operates. Should I be considering replacing ILIKE with TSEARCH completely?
A full text search setup is not identical to a "contains" like query. It stems words etc so you can match "cars" against "car".
If you really want a fast ILIKE then no standard database index or FTS will help. Fortunately, the pg_trgm module can do that.
http://www.postgresql.org/docs/9.1/static/pgtrgm.html
http://www.depesz.com/2011/02/19/waiting-for-9-1-faster-likeilike/
One thing that is very important: NO B-TREE INDEX will ever improve this kind of search:
where field ilike '%SOMETHING%'
What I am saying is that if you do a:
create index idx_name on some_table(field);
The only access you will improve is where field like 'something%'. (when you search for values starting with some literal). So, you will get no benefit by adding a regular index to field column in this case.
If you need to improve your search response time, definitely consider using FULL TEXT SEARCH.
Adding a bit to what the others have said.
First you can't really use an index based on a value in the middle of the string. Indexes are tree searches generally, and you have no way to know if your search will be faster than just scanning the table, so PostgreSQL will default to a seq scan. Indexes will only be used if they match the first part of the string. So:
SELECT * FROM invoice
WHERE invoice_number like 'INV-2012-435%'
may use an index but like '%44354456%' cannot.
In general in LedgerSMB we use both, depending on what kind of search we are doing. You might see a search like:
select * from parts
WHERE partnumber ilike ? || '%'
and plainto_tsquery(get_default_language(), ?) ## description;
So these are very different. Use each one where it makes the most sense.