Word frequencies from strings in Postgres? - postgresql

Is it possible to identify distinct words and a count for each, from fields containing text strings in Postgres?

Something like this?
SELECT some_pk,
regexp_split_to_table(some_column, '\s') as word
FROM some_table
Getting the distinct words is easy then:
SELECT DISTINCT word
FROM (
SELECT regexp_split_to_table(some_column, '\s') as word
FROM some_table
) t
or getting the count for each word:
SELECT word, count(*)
FROM (
SELECT regexp_split_to_table(some_column, '\s') as word
FROM some_table
) t
GROUP BY word

You could also use the PostgreSQL text-searching functionality for this, for example:
SELECT * FROM ts_stat('SELECT to_tsvector(''hello dere hello hello ridiculous'')');
will yield:
word | ndoc | nentry
---------+------+--------
ridicul | 1 | 1
hello | 1 | 3
dere | 1 | 1
(3 rows)
(PostgreSQL applies language-dependent stemming and stop-word removal, which could be what you want, or maybe not. Stop-word removal and stemming can be disabled by using the simple instead of the english dictionary, see below.)
The nested SELECT statement can be any select statement that yields a tsvector column, so you could substitute a function that applies the to_tsvector function to any number of text fields, and concatenates them into a single tsvector, over any subset of your documents, for example:
SELECT * FROM ts_stat('SELECT to_tsvector(''english'',title) || to_tsvector(''english'',body) from my_documents id < 500') ORDER BY nentry DESC;
Would yield a matrix of total word counts taken from the title and body fields of the first 500 documents, sorted by descending number of occurrences. For each word, you'll also get the number of documents it occurs in (the ndoc column).
See the documentation for more details: http://www.postgresql.org/docs/current/static/textsearch.html

Should be split by a space ' ' or other delimit symbol between words; not by an 's', unless intended to do so, e.g., treating 'myWordshere' as 'myWord' and 'here'.
SELECT word, count(*)
FROM (
SELECT regexp_split_to_table(some_column, ' ') as word
FROM some_table
) t
GROUP BY word

Related

How to find posts tagged with any of the predefined tags in Postgresql

I have posts table with the following structure:
| id | score | title | tags |
-------------------------------------------------
| 1 | 42 | Travel | <uk><travel><passport> |
For each blog post I want to find relevant posts, tagged with any of the tags corresponding to the current page, in my case: <uk>, <travel> or <passport>. Then, order results by score, limit it to 5 items and display it to the user.
This is the code I came up with so far, but it seems only getting the result for the first tag in the query – <uk>.
with tags_string (tag) as (
select unnest(string_to_array('<uk><travel><passport>', '>'))
)
select *
from
(
select distinct *
from posts
cross join tags_string
cross join lateral
(select
(tags ~ tag)::int as match_found
) m
where m.match_found > 0
) t
order by t.score desc
limit 5;
EDIT
After #Mike Organek's comment I changed the query this, and it's working as I initially expected.
with tags_string (tag) as (
select unnest(string_to_array('<uk><travel><passport>', '>'))
)
select *
from
(
select distinct *
from posts
cross join tags_string
cross join lateral
(select
position(tag in tags) > 0 as match_found
) m
where m.match_found and tag <> ''
) t
order by t.score desc
limit 5;
I would convert the tags into an array then use array operators to find the relevant posts:
select id, title, score, tags
from posts
where string_to_array(trim(both '<>' from replace(tags, '><', ',')), ',') #> array['uk', 'travel', 'passport']
order by score
limit 5
In the long run, storing the tags as an array or a jsonb array is probably a lot more efficient.
If you do that a lot, things might get a bit easier if you create a function for this:
create function tags_array(p_input text)
returns text[]
as
$$
select string_to_array(trim(both '<>' from replace(p_input, '><', ',')), ',');
$$
language sql
immutable;
Then the query is a bit easier to read:
select id, title, score, tags
from posts
where tags_array(tags) #> array['uk', 'travel', 'passport']
order by score
limit 5
You can even create an index for that if you want:
create index on posts using gin ( (tags_array(tags)) );

Split and sequentially join string parts in Postgresql

I need to create a DB view with parts of sequential combinations of string parts of a source column. Example:
IN:
tag
--------
A_B_C_D
X_Y_Z
OUT:
subtag
--------
A
A_B
A_B_C
A_B_C_D
X
X_Y
X_Y_Z
The answer seems to be somewhere around WITH RECURSIVE, but I cannot put it all together.
demo:db<>fiddle
SELECT
array_to_string( -- 3
array_agg(t.value) OVER (PARTITION BY tags ORDER BY t.number), --2
'_'
) AS subtag
FROM
tags,
regexp_split_to_table(tag, '_') WITH ORDINALITY as t(value, number) -- 1
Split the string into one row per element. The WITH ORDINALITY adds a row count which can be used to hold the original order of the elements
Using array_agg() window function to aggregate the elements. The ORDER BY makes it cumulative
Reaggregate the array into a string.
You can use a recursive query:
WITH RECURSIVE s AS (
SELECT tag FROM tag
UNION
SELECT regexp_replace(tag, '_[^_]*$', '') FROM s
)
SELECT * FROM s;
tag
---------
A_B_C_D
X_Y_Z
A_B_C
X_Y
A_B
X
A
(7 rows)
The idea is to successively cut off _* at the end.
Thanks a lot #laurenz-albe! There is a problem with your code that it's missing recursion break condition. So I ended up with this:
WITH RECURSIVE s AS (
SELECT tag FROM tag
UNION
SELECT regexp_replace(tag, '_[^_]*$', '')
FROM s
WHERE tag LIKE '%\_%'
)
SELECT * FROM s;
db<>fiddle

Does String Value Exists in a List of Strings | Redshift Query

I have some interesting data, I'm trying to query however I cannot get the syntax correct. I have a temporary table (temp_id), which I've filled with the id values I care about. In this example it is only two ids.
CREATE TEMPORARY TABLE temp_id (id bigint PRIMARY KEY);
INSERT INTO temp_id (id) VALUES ( 1 ), ( 2 );
I have another table in production (let's call it foo) which holds multiples those ids in a single cell. The ids column looks like this (below) with ids as a single string separated by "|"
ids
-----------
1|9|3|4|5
6|5|6|9|7
NULL
2|5|6|9|7
9|11|12|99
I want to evaluate each cell in foo.ids, and see if any of the ids in match the ones in my temp_id table.
Expected output
ids |does_match
-----------------------
1|9|3|4|5 |true
6|5|6|9|7 |false
NULL |false
2|5|6|9|7 |true
9|11|12|99 |false
So far I've come up with this, but I can't seem to return anything. Instead of trying to create a new column does_match I tried to filter within the WHERE statement. However, the issue is I cannot figure out how to evaluate all the id values in my temp table to the string blob full of the ids in foo.
SELECT
ids,
FROM foo
WHERE ids = ANY(SELECT LISTAGG(id, ' | ') FROM temp_ids)
Any suggestions would be helpful.
Cheers,
this would work, however not sure about performance
SELECT
ids
FROM foo
JOIN temp_ids
ON '|'||foo.ids||'|' LIKE '%|'||temp_ids.id::varchar||'|%'
you wrap the IDs list into a pair of additional separators, so you can always search for |id| including the first and the last number
The following SQL (I know it's a bit of a hack) returns exactly what you expect as an output, tested with your sample data, don't know how would it behave on your real data, try and let me know
with seq AS ( # create a sequence CTE to implement postgres' unnest
select 1 as i union all # assuming you have max 10 ids in ids field,
# feel free to modify this part
select 2 union all
select 3 union all
select 4 union all
select 5 union all
select 6 union all
select 7 union all
select 8 union all
select 9 union all
select 10)
select distinct ids,
case # since I can't do a max on a boolean field, used two cases
# for 1s and 0s and converted them to boolean
when max(case
when t.id in (
select split_part(ids,'|',seq.i) as tt
from seq
join foo f on seq.i <= REGEXP_COUNT(ids, '|') + 1
where tt != '' and k.ids = f.ids)
then 1
else 0
end) = 1
then true
else false
end as does_match
from temp_id t, foo
group by 1
Please let me know if this works for you!

How to split a string in a smart way?

Function string_to_array splits strings without grouping substrings in apostrophes:
# select unnest(string_to_array('one, "two,three"', ','));
unnest
--------
one
"two
three"
(3 rows)
I would like to have a smarter function, like this:
# select unnest(smarter_string_to_array('one, "two,three"', ','));
unnest
--------
one
two,three
(2 rows)
Purpose.
I know that COPY command does it in a proper way, but I need this feature internally.
I want to parse a text representation of rows of existing table. Example:
# select * from dataset limit 2;
id | name | state
----+-----------------+--------
1 | Smith, Reginald | Canada
2 | Jones, Susan |
(2 rows)
# select dataset::text from dataset limit 2;
dataset
------------------------------
(1,"Smith, Reginald",Canada)
(2,"Jones, Susan","")
(2 rows)
I want to do it dynamically in a plpgsql function for different tables. I cannot assume constant number of columns of a table nor a format of columns values.
There is a nice method to transpose a whole table into a one-column table:
select (json_each_text(row_to_json(t))).value from dataset t;
If the column id is unique then
select id, array_agg(value) arr from (
select row_number() over() rn, id, value from (
select id, (json_each_text(row_to_json(t))).value from dataset t
) alias
order by id, rn
) alias
group by id;
gives you exactly what you want. Additional query with row_number() is necessary to keep original order of columns.

PostgreSQL: Pivot a 2-column result set into a single-row table

Struggling with what I thought would be a straightforward operation...
EDIT: SQLFiddle available here: http://sqlfiddle.com/#!15/11711/1/0
Using PostgreSQL 9.4, pretend I have a query that returns this two-column set:
CATEGORY | TOTAL
all | 14
soccer | 5
baseball | 6
hockey | 3
However I'd prefer to pivot it into a single-row set:
ALL | SOCCER | BASEBALL | HOCKEY
14 | 5 | 6 | 3
In other words, I want all my "CATEGORY" values to become columns, with the corresponding "TOTAL" value to be placed in the first row under the appropriate column.
I've been trying to use CROSSTAB()... but as of now I'm getting the following error:
ERROR: a column definition list is required for functions returning "record"
For reference, here's what I'm trying to put as my SQL command:
SELECT * FROM crosstab(
$$
WITH "countTotal" AS (
SELECT text 'all' AS "sportType", COUNT(*) AS "total"
FROM log
WHERE type = 'SPORT_EVENT_CREATED'
GROUP BY "sportType"
),
"countBySportType" AS (
SELECT sport_type AS "sportType", COUNT(*) AS "total"
FROM log
WHERE type = 'SPORT_EVENT_CREATED'
GROUP BY "sportType"
)
SELECT * FROM "countTotal"
UNION
SELECT * FROM "countBySportType"
$$
)
I think you have to specify names and types of the output columns. From the postgres manual tablefunc
The crosstab function is declared to return setof record, so the
actual names and types of the output columns must be defined in the
FROM clause of the calling SELECT statement, for example:
SELECT * FROM crosstab('...') AS ct(row_name text, category_1 text, category_2 text);
You have to use crosstabN(text) to use it with dynamic number of columns. This PostgreSQL Crosstab Query whole lot of details about the cross tab query.
One more post Dynamic alternative to pivot with CASE and GROUP BY