postgresql find similar word groups

postgresql find similar word groups - postgresql

I have a table1 containing a column A, where ~100,000 strings (varchar) are stored. Unfortunately, each string has multiple words which are seperated with spaces. Further they have different length, i.e. one string can consist of 3 words while an other string contains 7 words.
Then I have a column B stored in a second table2 which contains only 100 strings in the same manner. Hence, multiple words per string, seperated by spaces.
The target is, to look how likely a record of Column B is matching with probably multiple records of column A based on the words. The result should also have a ranking. I was thinking of using full text search in a loop but I don't know how to do this, or if there is a proper way to achieve this?

I don't know if you can "tturn" table to a dictionary to use full text for ranking here. But you can query it with some primityve ranking quite easily, eg:
t=# with a(a) as (values('a b c'),('a c d'),('b e f'),('r b t'),('q w'))
, b(i,b) as (values(1,'a b'), (2,'e'), (3,'b'))
, p as (select unnest(string_to_array(b.b,' ')) arr,i from b)
select a phrases,arr match_words,count(1) over (partition by arr) words_in_matches, count(1) over (partition by i) matches,i from a left join p on a.a like '%'||arr||'%';
phrases | match_words | words_in_matches | matches | i
---------+-------------+------------------+---------+---
r b t | b | 6 | 5 | 1
a b c | b | 6 | 5 | 1
b e f | b | 6 | 5 | 1
a b c | a | 2 | 5 | 1
a c d | a | 2 | 5 | 1
b e f | e | 1 | 1 | 2
r b t | b | 6 | 3 | 3
a b c | b | 6 | 3 | 3
b e f | b | 6 | 3 | 3
q w | | 1 | 1 |
(10 rows)
phrases are rows from your big table.
match_words are tokens from your small table (splitted by spaces)
words_in_matches amount of tokens in phrases
matches is amount of matches in big table phrases from small table phrases
i index of phrase from small table
So you can order by third or fourth column to get some sort of ranking...

Related

Aggregate all combinations of rows taken k at a time

I am trying to calculate an aggregate function for a field for a subset of rows in a table. The problem is that I'd like to find the mean of every combination of rows taken k at a time --- so for all the rows, I'd like to find (say) the mean of every combination of 10 rows. So:
id | count
----|------
1 | 5
2 | 3
3 | 6
...
30 | 16
should give me
mean of ids 1..10; ids 1, 3..11; ids 1, 4..12, and so so. I know this will yield a lot of rows.
There are SO answers for finding combinations from arrays. I could do this programmatically by taking 30 ids 10 at a time and then SELECTing them. Is there a way to do this with PARTITION BY, TABLESAMPLE, or another function (something like python's itertools.combinations())? (TABLESAMPLE by itself won't guarantee which subset of rows I am selecting as far as I can tell.)

The method described in the cited answer is static. A more convenient solution may be to use recursion.
Example data:
drop table if exists my_table;
create table my_table(id int primary key, number int);
insert into my_table values
(1, 5),
(2, 3),
(3, 6),
(4, 9),
(5, 2);
Query which finds 2 element subsets in 5 element set (k-combination with k = 2):
with recursive recur as (
select
id,
array[id] as combination,
array[number] as numbers,
number as sum
from my_table
union all
select
t.id,
combination || t.id,
numbers || t.number,
sum+ number
from my_table t
join recur r on r.id < t.id
and cardinality(combination) < 2 -- param k
)
select combination, numbers, sum/2.0 as average -- param k
from recur
where cardinality(combination) = 2 -- param k
combination | numbers | average
-------------+---------+--------------------
{1,2} | {5,3} | 4.0000000000000000
{1,3} | {5,6} | 5.5000000000000000
{1,4} | {5,9} | 7.0000000000000000
{1,5} | {5,2} | 3.5000000000000000
{2,3} | {3,6} | 4.5000000000000000
{2,4} | {3,9} | 6.0000000000000000
{2,5} | {3,2} | 2.5000000000000000
{3,4} | {6,9} | 7.5000000000000000
{3,5} | {6,2} | 4.0000000000000000
{4,5} | {9,2} | 5.5000000000000000
(10 rows)
The same query for k = 3 gives:
combination | numbers | average
-------------+---------+--------------------
{1,2,3} | {5,3,6} | 4.6666666666666667
{1,2,4} | {5,3,9} | 5.6666666666666667
{1,2,5} | {5,3,2} | 3.3333333333333333
{1,3,4} | {5,6,9} | 6.6666666666666667
{1,3,5} | {5,6,2} | 4.3333333333333333
{1,4,5} | {5,9,2} | 5.3333333333333333
{2,3,4} | {3,6,9} | 6.0000000000000000
{2,3,5} | {3,6,2} | 3.6666666666666667
{2,4,5} | {3,9,2} | 4.6666666666666667
{3,4,5} | {6,9,2} | 5.6666666666666667
(10 rows)
Of course, you can remove numbers from the query if you do not need them.

Evaluate Values From Multiple Rows As Part of Aggregate or Window Function

I need to find a way to tell if a column has two specific values within a grouped/partitioned section. Easiest to describe by example. I have table "foo" with the following data:
ID | Indicator
1 | A
1 | B
1 | B
2 | C
2 | B
3 | A
3 | B
3 | B
3 | C
4 | A
4 | C
For my output I want a result of "A" if one of the rows in the group has Indicator "A". If not, then "C" if one of the rows Indicator is "C". But in the case where the group has an Indicator of "A" and an Indicator of "C" I want a result of "X" for the group. Given the data I want the following result:
ID | Result
1 | A
2 | C
3 | X
4 | X
The result of A or C (ID 1 and 2 in the example) can be done using a partition and windows function this way:
SELECT DISTINCT ID,
priority_indicator
FROM (SELECT ID,
first_value(Indicator) OVER
(PARTITION BY ID
ORDER BY
CASE
WHEN Indicator = 'A' THEN
1
WHEN Indicator = 'C' THEN
2
ELSE
3
END
) priority_indicator
FROM foo) a
How would you look at the values in multiple rows at once to return an "X" when there's both an "A" and a "C" in the Indicator?

--test data
WITH foo(id,indicator) AS ( VALUES
(1,'A'),
(1,'B'),
(1,'B'),
(2,'C'),
(2,'B'),
(3,'A'),
(3,'B'),
(3,'B'),
(3,'C'),
(4,'A'),
(4,'C')
),
-- get all entries for each Id in indicator_set
agg AS (
SELECT id,array_agg(DISTINCT(indicator)) AS indicator_set FROM foo
GROUP BY id
)
-- actual query
SELECT id,
CASE
WHEN indicator_set #> '{A,C}' THEN 'X'
WHEN indicator_set #> '{A}' THEN 'A'
WHEN indicator_set #> '{C}' THEN 'C'
END result
FROM agg;
Output:
id | result
----+--------
1 | A
2 | C
3 | X
4 | X
(4 rows)

SQLite Manager: how to select columns in a table, keeping the count of the original occurrence of their values?

I am new to databases, and I am using the SQLite Manager Firefox add-on for a table like this:
rowid | col1 | col2 | col3
1 | N | Y | N
2 | N | N | N
3 | N | Y | Y
4 | N | Y | N
and I would like to reduce it to a table with a smaller number of columns (for instance 2) in the following way: each row should represent one possible combination of values (N,Y) of the 2 selected columns, associated with a count. This count should represent the number of rows in the original table where the other column assumed different values but the selected 2 columns have the given combination of values.
To be clear, if I select column 2 and 3, I would like to obtain:
col2 | col3 | count
Y | N | 2
N | N | 1
Y | Y | 1
while if I select columns 1 and 2:
col1 | col2 | count
N | Y | 3
N | N | 1
I have tried to use a combination of commands such as COUNT and GROUP BY, but without reaching my goal. For instance, I`ve tried to use:
SELECT *, COUNT (*) AS count FROM table_test
GROUP BY col2, col3
but it seems to work only for col2, giving me the # times col2=Y and the # times col2=N, but not in combination with col3...
Do you have any suggestion?

sql query to break down count of every combination

I need a Postgresql Query that returns the count of every type of combination of record.
For example, I have a table T with columns A, B, C, D, E and other columns that are not of importance:
Table T
--------------
A | B | C | D | E
The query should return a table R with the values from columns A, B, C, D, and a count for how many times each configuration occurs with the specified E value.
Table R
---------------
A | B | C | D | count
When all of the counts for each record are added together, it should equal the total number of records in the original table.
It seems like a very simple problem, but due to my lack of SQL knowledge, I cannot figure out how to do this.
The only solution I can think of is this:
select a, b, c, d, count(*)
from T
where e = 'abc'
group by a, b, c, d
But when adding the counts up from this query, it is way more than the count of the original table. It seems like count(*) shouldn't be used, or i'm just totally going about this the wrong way. I'd really appreciate any advice as to how I should go about this. Thank you all.

NULL values couldn't possibly fool you. Consider this demo:
WITH t(a,b,c,d) AS (
VALUES
(1,2,3,4)
,(1,2,3,NULL)
,(2,2,3,NULL)
,(2,2,3,NULL)
,(2,2,3,4)
,(2,NULL,NULL,NULL)
,(NULL,NULL,NULL,NULL)
)
SELECT a, b, c, d, count(*)
FROM t
GROUP BY a, b, c, d
ORDER BY a, b, c, d;
a | b | c | d | count
---+---+---+---+-------
1 | 2 | 3 | 4 | 1
1 | 2 | 3 | | 1
2 | 2 | 3 | 4 | 1
2 | 2 | 3 | | 2
2 | | | | 1
| | | | 1
There must be some other misunderstanding here.

I figured it out, it was something really stupid. I forgot to specify the where 'E' = 'ABC' clause in the select count(*) when comparing the count. Thanks anyway for your help guys!

postgres counting one record twice if it meets certain criteria

I thought that the query below would naturally do what I explain, but apparently not...
My table looks like this:
id | name | g | partner | g2
1 | John | M | Sam | M
2 | Devon | M | Mike | M
3 | Kurt | M | Susan | F
4 | Stacy | F | Bob | M
5 | Rosa | F | Rita | F
I'm trying to get the id where either the g or g2 value equals 'M'... But, a record where both the g and g2 values are 'M' should return two lines, not 1.
So, in the above sample data, I'm trying to return:
$q = pg_query("SELECT id FROM mytable WHERE ( g = 'M' OR g2 = 'M' )");
1
1
2
2
3
4
But, it always returns:
1
2
3
4

Your query doesn't work because each row is returned only once whether it matches one or both of the conditions. To get what you want use two queries and use UNION ALL to combine the results:
SELECT id FROM mytable WHERE g = 'M'
UNION ALL
SELECT id FROM mytable WHERE g2 = 'M'
ORDER BY id
Result:
1
1
2
2
3
4

you might try a UNION along these lines:
"SELECT id FROM mytable WHERE ( g = 'M') UNION SELECT id FROM mytable WHERE ( g2 = 'M')"
Hope this helps, Martin

SELECT id FROM mytable WHERE g = 'M'
UNION
SELECT id FROM mytable WHERE g2 = 'M'

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

postgresql find similar word groups - postgresql

Related

Aggregate all combinations of rows taken k at a time

Evaluate Values From Multiple Rows As Part of Aggregate or Window Function

SQLite Manager: how to select columns in a table, keeping the count of the original occurrence of their values?

sql query to break down count of every combination

postgres counting one record twice if it meets certain criteria

Categories

Resources