postgres full text search word count

postgres full text search word count - postgresql

I've been messing around with full text search in postgres, and I was wondering, is it possible to return total word counts of all rows?
So, let's say you have
text_col
_______
'dog'
'dog cat'
'dog bird dog'
the count of 'dog' should be four, the count of 'cat' should be one and bird should also be one.
Right now I have all the tsvectors saved to a gin indexed column as well.
Of course this would be across all rows where you could say something like
select max(ts_count(text_col_tsvector)) from mytable;
(I made that up, but I hope you get the general idea)
is it only possible to return the count of the lexeme, and if so, how does one return the dictionary (or array) of the lexeme returned.

how about:
select * from ts_stat('select text_col_tsvector from mytable')
Edit:
You mean:
with words as (
select regexp_split_to_table(text_column , E'\\W+') as word
from mytable
)
select word, count(*) as cnt from words group by 1 order by 2 desc
?

Related

PostgreSQL: json object where keys are unique array elements and values are the count of times they appear in the array

I have an array of strings, some of which may be repeated. I am trying to build a query which returns a single json object where the keys are the distinct values in the array, and the values are the count of times each value appears in the array.
I have built the following query;
WITH items (item) as (SELECT UNNEST(ARRAY['a','b','c','a','a','a','c']))
SELECT json_object_agg(distinct_values, counts) item_counts
FROM (
SELECT
sub2.distinct_values,
count(items.item) counts
FROM (
SELECT DISTINCT items.item AS distinct_values
FROM items
) sub2
JOIN items ON items.item = sub2.distinct_values
GROUP BY sub2.distinct_values, items.item
) sub1
DbFiddle
Which provides the result I'm looking for: { "a" : 4, "b" : 1, "c" : 2 }
However, it feels like there's probably a better / more elegant / less verbose way of achieving the same thing, so I wondered if any one could point me in the right direction.
For context, I would like to use this as part of a bigger more complex query, but I didn't want to complicate the question with irrelevant details. The array of strings is what one column of the query currently returns, and I would like to convert it into this JSON blob. If it's easier and quicker to do it in code then I can, but I wanted to see if there was an easy way to do it in postgres first.

I think a CTE and json_object_agg() is a little bit of a shortcut to get you there?
WITH counter AS (
SELECT UNNEST(ARRAY['a','b','c','a','a','a','c']) AS item, COUNT(*) AS item_count
GROUP BY 1
ORDER BY 1
)
SELECT json_object_agg(item, item_count) FROM counter
Output:
{"a":4,"b":1,"c":2}

Comparing two text array columns using % and ordering by number of matches

I have a user table that contains a "skills" column which is a text array. Given some input array, I would like to find all the users whose skills % one or more of the entries in the input array, and order by number of matches (according to the % operator from pg_trgm).
For example, I have Array['java', 'ruby', 'postgres'] and I want users who have these skills ordered by the number of matches (max is 3 in this case).
I tried unnest() with an inner join. It looked like I was getting somewhere, but I still have no idea how I can capture the count of the matching array entries. Any ideas on what the structure of the query may look like?
Edit: Details:
Here is what my programmers table looks like:
id | skills
----+-------------------------------
1 | {javascript,rails,css}
2 | {java,"ruby on rails",adobe}
3 | {typescript,nodejs,expressjs}
4 | {auth0,c++,redis}
where skills is a text array.
Here is what I have so far:
SELECT * FROM programmers, unnest(skills) skill_array(x)
INNER JOIN unnest(Array['ruby', 'node']) search(y)
ON skill_array.x % search.y;
which outputs the following:
id | skills | x | y
----+-------------------------------+---------------+---------
2 | {java,"ruby on rails",adobe} | ruby on rails | ruby
3 | {typescript,nodejs,expressjs} | nodejs | node
3 | {typescript,nodejs,expressjs} | expressjs | express
*Assuming pg_trgm is enabled.

For an exact match between the user skills and the searched skills, you can proceed like this :
You put the searched skills in the target_skills text array
You filter the users from the table user_table whose user_skills array has at least one common element with the target_skills array by using the && operator
For each of the selected users, you select the common skills by using unnest and INTERSECT, and you calculate the number of these common skills
You order the result by the number of common skills DESC
In this process, the users with skill "ruby" will be selected for the target skill "ruby", but not the users with skill "ruby on rails".
This process can be implemented as follow :
SELECT u.user_id
, u.user_skills
, inter.skills
FROM user_table AS u
CROSS JOIN LATERAL
( SELECT array( SELECT unnest(u.user_skills)
INTERSECT
SELECT unnest(target_skills)
) AS skills
) AS inter
WHERE u.user_skills && target_skills
ORDER BY array_length(inter.skills, 1) DESC
or with this variant :
SELECT u.user_id
, u.user_skills
, array_agg(t_skill) AS inter_skills
FROM user_table AS u
CROSS JOIN LATERAL unnest(target_skills) AS t_skill
WHERE u.user_skills && array[t_skill]
GROUP BY u.user_id, u.user_skills
ORDER BY array_length(inter_skills, 1) DESC
This query can be accelerated by creating a GIN index on the user_skills column of the user_table.
For a partial match between the user skills and the target skills (ie the users with skill "ruby on rails" must be selected for the target skill "ruby"), you need to use the pattern matching operator LIKE or the regular expression, but it is not possible to use them with text arrays, so you need first to transform your user_skills text array into a simple text with the function array_to_string. The query becomes :
SELECT u.user_id
, u.user_skills
, array_agg(t_skill) AS inter_skills
FROM user_table AS u
CROSS JOIN unnest(target_skills) AS t_skill
WHERE array_to_string(u.user_skills, ' ') ~ t_skill
GROUP BY u.user_id, u.user_skills
ORDER BY array_length(inter_skills, 1) DESC ;
Then you can accelerate the queries by creating the following GIN (or GiST) index :
DROP INDEX IF EXISTS user_skills ;
CREATE INDEX user_skills
ON user_table
USING gist (array_to_string(user_skills, ' ') gist_trgm_ops) ; -- gin_trgm_ops and gist_trgm_ops indexes are compliant with the LIKE operator and the regular expressions
In any case, managing the skills as text will ever fail if there are typing errors or if the skills list is not normalized.

I accepted Edouard's answer, but I thought I'd show something else I adapted from it.
CREATE OR REPLACE FUNCTION partial_and_and(list1 TEXT[], list2 TEXT[])
RETURNS BOOLEAN AS $$
SELECT EXISTS(
SELECT * FROM unnest(list1) x, unnest(list2) y
WHERE x % y
);
$$ LANGUAGE SQL IMMUTABLE;
Then create the operator:
CREATE OPERATOR &&% (
LEFTARG = TEXT[],
RIGHTARG = TEXT[],
PROCEDURE = partial_and_and,
COMMUTATOR = &&%
);
And finally, the query:
SELECT p.id, p.skills, array_agg(t_skill) AS inter_skills
FROM programmers AS p
CROSS JOIN LATERAL unnest(Array['ruby', 'java']) AS t_skill
WHERE p.skills &&% array[t_skill]
GROUP BY p.id, p.skills
ORDER BY array_length(inter_skills, 1) DESC;
This will output an error saying column 'inter_skills' does not exist (not sure why), but oh well point is the query seems to work. All credit goes to Edouard.

Why does usage of lower() changes the order of resultset?

I have a table where I store information about users. The table has the following structure:
CREATE TABLE PERSONS
(
ID NUMBER(20, 0) NOT NULL,
FIRSTNAME VARCHAR2(40),
LASTNAME VARCHAR2(40),
BIRTHDAY DATE,
CONSTRAINT PERSONEN_PK PRIMARY KEY
(ID)
ENABLE
);
After inserting some test data:
SET DEFINE OFF;
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('1','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('2','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('3','Carl','Carlchen',to_date('01.01.12','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('4','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('5','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values ('6','Carl','Carlchen',to_date('01.01.12','DD.MM.RR'));
I want to select all duplicates of a given user. Let's use "Max Mustermann" for example:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE p.firstname = 'Max'
AND p.lastname = 'Mustermann'
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.firstname,p.lastname;
This gives me a result like this:
id first last birthday
=================================
1 Max Mustermann 31.10.89
2 Max Mustermann 31.10.89
4 Max Mustermann 31.10.89
5 Max Mustermann 31.10.89
I want to do a case insensitive compare, so I change the query using lower (and trim) like this:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE lower(trim(p.firstname)) = lower(trim('mAx '))
AND lower(trim(p.lastname)) = lower(trim(' musteRmann '))
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.lastname,p.firstname;
Now surprise the order has changed!
id first last birthday
=================================
1 Max Mustermann 31.10.89
5 Max Mustermann 31.10.89
4 Max Mustermann 31.10.89
2 Max Mustermann 31.10.89
Why does the order change, just by using lower() (same result when using without trim())!? I can get a stable ordering by adding the id column to the ORDER BY. But shouldn't the lower() have no affect to the ordering?
Workaround by also using id column for ORDER BY:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE p.firstname = 'Max'
AND p.lastname = 'Mustermann'
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.firstname,p.lastname,p.id;
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE lower(trim(p.firstname)) = lower(trim('mAx '))
AND lower(trim(p.lastname)) = lower(trim(' musteRmann '))
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.lastname,p.firstname,p.id;

If the values to be ordered by are identical, then the DBMS is free to choose any order it feels correct (the same way it is free to choose any order if no order by is specified alltogether).
Because all values of the columns in the order by are identical the resulting order is not stable. The only way to get a stable order is to include a unique column as an additional order criteria for ties - exactly what you did when you added the id column.
Why does the order change, just by using lower()
From a technical point, I'd guess that applying the lower() changed the execution plan and therefor the access path to the data.
But again (just to make sure): ordering on identical values never guarantees a stable order!

There is no ordering without an order by clause. Sometimes it looks like there might be (group by fooled a lot of people in older releases`, but it's only coincidental, and must not be relied upon. In your case you're ordering by some columns, but you expect duplicates within that ordering to be further ordered implicitly, which won't happen - or at least cannot be relied on.
In this case Oracle probably happens to be retrieving the rows for your first query in the order you inserted them purely as a side effect of how it's reading data from the blocks, and the order by sorts them within that set without actually changing them (or quite likely it's skipping the order by step internally if it realises it's pointless; the explain plan would tell you that).
If you change the order the order the records are created:
...
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values
('5','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
Insert into PERSONS (ID,FIRSTNAME,LASTNAME,BIRTHDAY) values
('4','Max','Mustermann',to_date('31.10.89','DD.MM.RR'));
...
then the result 'order' changes too:
SELECT p.id,p.firstname,p.lastname,p.birthday
FROM persons p
WHERE p.firstname = 'Max'
AND p.lastname = 'Mustermann'
AND p.birthday = to_date('31.10.1989','dd.mm.yyyy')
ORDER BY p.firstname,p.lastname;
ID FIRSTNAME LASTNAME BIRTHDAY
---------- -------------------- -------------------- ---------
1 Max Mustermann 31-OCT-89
2 Max Mustermann 31-OCT-89
5 Max Mustermann 31-OCT-89
4 Max Mustermann 31-OCT-89
Once you have the function things are changing enough for that happy accident to go out of the window, even if the records are inserted in id order (which has no relevance to the DB internally). lower() isn't changing the ordering, you just aren't getting lucky any more.
You cannot expect or rely on an order unless you fully specify it in the order by clause.

PostgreSQL and word games

In a word game similar to Ruzzle or Letterpress, where users have to construct words out of a given set of letters:
I keep my dictionary in a simple SQL table:
create table good_words (
word varchar(16) primary key
);
Since the game duration is very short I do not want to check every entered word by calling a PHP script, which would look that word up in the good_words table.
Instead I'd like to download all possible words by one PHP script call before the round starts - since all letters are known.
My question is: if there is a nice SQLish way to find such words?
I.e. I could run a longer-taking script once to add a column to good_words table, which would have same letters as in the word columnt, but sorted alphabetically... But I still can't think of a way to match for it given a set of letters.
And doing the word matching inside of a PHP script (vs. inside the database) would probably take too long (because of bandwidth: would have to fetch every row from the database to the PHP script).
Any suggestions or insights please?
Using postgresql-8.4.13 with CentOS Linux 6.3.
UPDATE:
Other ideas I have:
Create a constantly running script (cronjob or daemon) which would prefill an SQL table with precompiled letters board and possible words - but still feels like a waste of bandwidth and CPU, I would prefer to solve this inside the database
Add integer columns a, b, ... , z and whenever I store a word into good_words, store the letter occurences there. I wonder if it is possible to create an insert trigger in Pl/PgSQL for that?

Nice question, I upvoted.
What you're up to is a list of all possible permutations of the given letters of a given length. As described in the PostgreSQL wiki, you can create a function and call it like this (matches highlighted letters in your screenshot):
SELECT * FROM permute('{E,R,O,M}'::text[]);
Now, to query the good_words use something like:
SELECT gw.word, gw.stamp
FROM good_words gw
JOIN permute('{E,R,O,M}'::text[]) s(w) ON gw.word=array_to_string(s.w, '');

This could be a start, except that it doesn't check if we have enough letters, only if he have the right letters.
SELECT word from
(select word,generate_series(0,length(word)) as s from good_words) as q
WHERE substring(word,s,1) IN ('t','h','e','l','e','t','t','e','r','s')
GROUP BY word
HAVING count(*)>=length(word);
http://sqlfiddle.com/#!1/2e3a2/3
EDIT:
This query select only the valid words though it seems a bit redundant. It's not perfect but certainly proves it can be done.
WITH words AS
(SELECT word, substring(word,s,1) as sub from
(select word,generate_series(1,length(word)) as s from good_words) as q
WHERE substring(word,s,1) IN ('t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s'))
SELECT w.word FROM
(
SELECT word,words.sub,count(DISTINCT s) as cnt FROM
(SELECT s, substring(array_to_string(l, ''),s,1) as sub FROM
(SELECT l, generate_subscripts(l,1) as s FROM
(SELECT ARRAY['t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s'] as l)
as q)
as q) as let JOIN
words ON let.sub=words.sub
GROUP BY words.word,words.sub) as let
JOIN
(select word,sub,count(*) as cnt from words
GROUP BY word, sub)
as w ON let.word=w.word AND let.sub=w.sub AND let.cnt>=w.cnt
GROUP BY w.word
HAVING sum(w.cnt)=length(w.word);
Fiddle with all possible 3+ letters words (485) for that image: http://sqlfiddle.com/#!1/2fc66/1
Fiddle with 699 words out of which 485 are correct: http://sqlfiddle.com/#!1/4f42e/1
Edit 2:
We can use array operators like so to get a list of words that contain the letters we want:
SELECT word as sub from
(select word,generate_series(1,length(word)) as s from good_words) as q
GROUP BY word
HAVING array_agg(substring(word,s,1)) <# ARRAY['t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s'];
So we can use it to narrow down the list of words we need to check.
WITH words AS
(SELECT word, substring(word,s,1) as sub from
(select word,generate_series(1,length(word)) as s from
(
SELECT word from
(select word,generate_series(1,length(word)) as s from good_words) as q
GROUP BY word
HAVING array_agg(substring(word,s,1)) <# ARRAY['t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s']
)as q) as q)
SELECT DISTINCT w.word FROM
(
SELECT word,words.sub,count(DISTINCT s) as cnt FROM
(SELECT s, substring(array_to_string(l, ''),s,1) as sub FROM
(SELECT l, generate_subscripts(l,1) as s FROM
(SELECT ARRAY['t','e','s','e','r','e','r','o','r','e','m','a','s','d','s','s'] as l)
as q)
as q) as let JOIN
words ON let.sub=words.sub
GROUP BY words.word,words.sub) as let
JOIN
(select word,sub,count(*) as cnt from words
GROUP BY word, sub)
as w ON let.word=w.word AND let.sub=w.sub AND let.cnt>=w.cnt
GROUP BY w.word
HAVING sum(w.cnt)=length(w.word) ORDER BY w.word;
http://sqlfiddle.com/#!1/4f42e/44
We can use GIN indexes to work on arrays so we probably could create a table that would store the arrays of letters and make words point to it (act, cat and tact would all point to array [a,c,t]) so probably that would speed things up but that's up for testing.

Create a table that has entries (id, char), be n the number of characters you are querying for.
select id, count(char) AS count from chartable where (char = x or char = y or char = z ...) and count = n group by id;
OR (for partial matching)
select id, count(char) AS count from chartable where (char = x or char = y or char = z ...) group by id order by count;
The result of that query has all the word-id's that fit the specifications. Cache the result in a HashSet and simple do a lookup whenever a word is entered.

You can add the column with sorterd letters formatted like '%a%c%t%'. Then use query:
select * from table where 'abcttx' like sorted_letters
to find words that can be built from letters 'abcttx'. I don't know about performance, but simplicity probably can't be beaten :)

Here is a query that finds the answers that can be found by walking through adjacent fields.
with recursive
input as (select '{{"t","e","s","e"},{"r","e","r","o"},{"r","e","m","a"},{"s","d","s","s"}}'::text[] as inp),
dxdy as(select * from (values(-1,-1),(-1,0),(-1,1),(0,1),(0,-1),(1,-1),(1,0),(1,1)) as v(dx, dy)),
start_position as(select * from generate_series(1,4) x, generate_series(1,4) y),
work as(select x,y,inp[y][x] as word from start_position, input
union
select w.x + dx, w.y + dy, w.word || inp[w.y+dy][w.x+dx]
from dxdy cross join input cross join work w
inner join good_words gw on gw.word like w.word || '%'
)
select distinct word from work
where exists(select * from good_words gw where gw.word = work.word)
(other answers don't take this into account).
Sql fiddle link: http://sqlfiddle.com/#!1/013cc/14 (notice You need an index with varchar_pattern_ops for the query to be reasonably fast).

Does not work in 8.4. Probably 9.1+ only. SQL Fidlle
select word
from (
select unnest(string_to_array(word, null)) c, word from good_words
intersect all
select unnest(string_to_array('TESTREROREMASDSS', null)) c, word from good_words
) s
group by word
having
array_agg(c order by c) =
(select array_agg(c order by c) from unnest(string_to_array(word, null)) a(c))

My own solution is to create an insert trigger, which writes letter frequencies into an array column:
create table good_words (
word varchar(16) primary key,
letters integer[26]
);
create or replace function count_letters() returns trigger as $body$
declare
alphabet varchar[];
i integer;
begin
alphabet := regexp_split_to_array('abcdefghijklmnopqrstuvwxyz', '');
new.word := lower(new.word);
for i in 1 .. array_length(alphabet, 1)
loop
-- raise notice '%: %', i, alphabet[i];
new.letters[i] := length(new.word) - length(replace(new.word, alphabet[i], ''));
end loop;
return new;
end;
$body$ language plpgsql;
create trigger count_letters
before insert on good_words
for each row execute procedure count_letters();
Then I generate similar array for the random board string tesereroremasdss
and compare both arrays using the array contains operator #>
Any new ideas or improvements are always welcome!

Cursor size (number of results)

How I can know cursor size (numbers of results)?
c CURSOR IS SELECT foo FROM mytable WHERE name='ok';

In my understanding, a cursor is NOT the result. You can use a cursor to GET your results row by row, and at the end of this row by row operations, you know how many results you got.
To know how many records you will (possibly) get, you can use a
select count(*) from ... where ...
assuming you have an index on column name, you could also write:
select count(name) from foo where name = 'ok'

If you want to obtain the total number of results without issuing a separate count query, you can:
SELECT count(1) OVER (), ... FROM ... WHERE ...
The count will be unaffected by ORDER/LIMIT clauses.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse