tsearch2 word statistics - postgresql

I have a column with a tsvector in my table. Now I'd like to find out witch words are represented above average, to add these to the stop word list. Is there a function for tsearch2 to list the frequency of all words in the index?

SELECT * FROM ts_stat('SELECT ts_vector_col FROM mytable')
ORDER BY nentry DESC, ndoc DESC, word ;
Gathering Document Statistics

Related

If I call the same postgres function with the same arguments twice in a query, does it calculate once or twice?

If I make the following query in postgres, is it calculating the ts_rank twice or just once? If it is calculating it twice, is it possible to make it calculate it only once?
SELECT id, name, "createdAt", price, ts_rank(document, to_tsquery(:query)) AS rank
FROM search_index
WHERE document ## to_tsquery(:query)
ORDER BY ts_rank(document, to_tsquery(:query)) DESC;
In this case, it should be calculated only once time. Postgres detects equal expressions. Generally, if you afraid about this, then you can calculate expression in subquery.
Some like:
SELECT c1, c1 FROM (SELECT exp AS c1) s;
The function to_tsquery() is very expensive without fulltext index. if you have fulltext index, and if there are only one hundreds selected records, then overhead of ts_rank should not be significant.

Counting the Number of Occurrences of a Multi-Word Phrase in Text with PostgreSQL

I have a problem, I need to count the frequency of a word phrase appearing within a text field in a PostgreSQL database.
I'm aware of functions such as to_tsquery() and I'm using it to check if a phrase exists within the text using to_tsquery('simple', 'sample text'), however, I'm unsure of how to count these occurrences accurately.
If the words are contained just once in the string (I am supposing here that your table contains two columns, one with an id and another with a text column called my_text):
SELECT
count(id)
FROM
my_table
WHERE
my_text ~* 'the_words_i_am_looking_for'
If the occurrences are more than one per field, this nested query can be used:
SELECT
id,
count(matches) as matches
FROM (
SELECT
id,
regexp_matches(my_text, 'the_words_i_am_looking_for', 'g') as matches
FROM
my_table
) t
GROUP BY 1
The syntax of this function and much more about string pattern matching can be found here.

Count all rows containing [Word] in [Column] in postgresql

I need to count and also get output of all rows in a table containing a given word in a specific column. Something like
ID Name Fave
678 Adam cannot bear this
355 Baloo bear is a bear
245 Cheeta runs fast
So that I can get an output of '2' (and not '3') on counting the number of rows containing the word 'bear' in the column 'Fave', and an output of the first two rows for the tabular output/select rows.
I've tried
SELECT * WHERE regexp_matches(Fave, 'bear') FROM table_name
but I'm getting a syntax error near FROM so I'm WHERE is where the trouble is at. Any pointers/help, please?
Are you looking for:
SELECT * FROM table_name WHERE Fave like '%bear%'
The FROM goes before the WHERE:
SELECT *
FROM table_name
WHERE regexp_matches(Fave, 'bear') ;
You can also use LIKE, of course, but the issue is the order of the clauses in the query.
select * from table_name where Fave ~* '\mbear\M';
~* - case-insensitive regexp matches
'\m...\M' - single word, so 'taddy bear' is matching and taddybear is not.

Postgresql tsvector structure

lo here.
i am trying to utilizing tsvector for counting frequencies of terms.
i think i am almost there but i cannot find a way to obtain terms from tsvector structure.
what I have done is, after creating tsvector column:
select term_tsv, count(*) count from (select unnest(term_tsv) term_tsv from document_tsv) t group by term_tsv order by count desc;
the result is like:
stem_tsv | count
------------------------+-------
(3,{9},{D}) | 1
i am lost for not knowing what kind of expression the parenthesis represents.
can anybody tell me how to extract the term from shell?
thank you.
i figured out that something like the following lists top 10 frequent entries,
which was written in the official manual.
SELECT * FROM ts_stat('SELECT vector FROM apod')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;
just for record.

Searching individual words in a string

I know about full-text search, but that only matches your query against individual words. I want to select strings that contain a word that starts with words in my query. For example, if I search:
appl
the following should match:
a really nice application
apples are cool
appliances
since all those strings contains words that start with appl. In addition, it would be nice if I could select the number of words that match, and sort based on that.
How can I implement this in PostgreSQL?
Prefix matching with Full Text Search
FTS supports prefix matching. Your query works like this:
SELECT * FROM tbl
WHERE to_tsvector('simple', string) ## to_tsquery('simple', 'appl:*');
Note the appended :* in the tsquery. This can use an index.
See:
Get partial match from GIN indexed TSVECTOR column
Alternative with regular expressions
SELECT * FROM tbl
WHERE string ~ '\mappl';
Quoting the manual here:
\m .. matches only at the beginning of a word
To order by the count of matches, you could use regexp_matches()
SELECT tbl_id, count(*) AS matches
FROM (
SELECT tbl_id, regexp_matches(string, '\mappl', 'g')
FROM tbl
WHERE string ~ '\mappl'
) sub
GROUP BY tbl_id
ORDER BY matches DESC;
Or regexp_split_to_table():
SELECT tbl_id, string, count(*) - 1 AS matches
FROM (
SELECT tbl_id, string, regexp_split_to_table(string, '\mappl')
FROM tbl
WHERE string ~ '\mappl'
) sub
GROUP BY 1, 2
ORDER BY 3 DESC, 2, 1;
db<>fiddle here
Old sqlfiddle
Postgres 9.3 or later has index support for simple regular expressions with a trigram GIN or GiST index. The release notes for Postgres 9.3:
Add support for indexing of regular-expression searches in pg_trgm
(Alexander Korotkov)
See:
PostgreSQL LIKE query performance variations
Depesz wrote a blog about index support for regular expressions.
SELECT * FROM some_table WHERE some_field LIKE 'appl%' OR some_field LIKE '% appl%';
As for counting the number of words that match, I believe that would be too expensive to do dynamically in postgres (though maybe someone else knows better). One way you could do it is by writing a function that counts occurrences in a string, and then add ORDER BY myFunction('appl', some_field). Again though, this method is VERY expensive (i.e. slow) and not recommended.
For things like that, you should probably use a separate/complimentary full-text search engine like Sphinx Search (google it), which is specialized for that sort of thing.
An alternative to that, is to have another table that contains keywords and the number of occurrences of those keywords in each string. This means you need to store each phrase you have (e.g. really really nice application) and also store the keywords in another table (i.e. really, 2, nice, 1, application, 1) and link that keyword table to your full-phrase table. This means that you would have to break up strings into keywords as they are entered into your database and store them in two places. This is a typical space vs speed trade-off.