Postgres frequency count across arrays - postgresql

I have a column of text[]. How do I get a frequency count of all the objects across the column?
Example:
col_a
--------
{a, b}
{a}
{b}
{a}
Output should be:
col_a | count
----------------
a | 3
b | 2
My query:
with all_tags as (
select array_agg(c)
from (
select unnest(tags)
from message_tags
) as dt(c)
)
select count(*) from all_tags;

figured it out:
-- Collapse all tags into one array
with all_tags as (
select array_agg(c) as arr
from (
select unnest(ner_tags)
from message_tags
) as dt(c)
),
-- Expand single array into a row per tag
row_tags as (
select unnest(arr) as tags from all_tags
)
-- count distinct tags
select tags, count(*) from row_tags group by tags

As an alternative, you could just skip several steps and directly group on the unnested value:
select unnest(ner_tags) as tags,
count(*) as cnt
from message_tags
group by tags
order by cnt desc
Since you only require a count over each of the values (no distinct or other aggregates), this is the simplest solution.

Related

How to find posts tagged with any of the predefined tags in Postgresql

I have posts table with the following structure:
| id | score | title | tags |
-------------------------------------------------
| 1 | 42 | Travel | <uk><travel><passport> |
For each blog post I want to find relevant posts, tagged with any of the tags corresponding to the current page, in my case: <uk>, <travel> or <passport>. Then, order results by score, limit it to 5 items and display it to the user.
This is the code I came up with so far, but it seems only getting the result for the first tag in the query – <uk>.
with tags_string (tag) as (
select unnest(string_to_array('<uk><travel><passport>', '>'))
)
select *
from
(
select distinct *
from posts
cross join tags_string
cross join lateral
(select
(tags ~ tag)::int as match_found
) m
where m.match_found > 0
) t
order by t.score desc
limit 5;
EDIT
After #Mike Organek's comment I changed the query this, and it's working as I initially expected.
with tags_string (tag) as (
select unnest(string_to_array('<uk><travel><passport>', '>'))
)
select *
from
(
select distinct *
from posts
cross join tags_string
cross join lateral
(select
position(tag in tags) > 0 as match_found
) m
where m.match_found and tag <> ''
) t
order by t.score desc
limit 5;
I would convert the tags into an array then use array operators to find the relevant posts:
select id, title, score, tags
from posts
where string_to_array(trim(both '<>' from replace(tags, '><', ',')), ',') #> array['uk', 'travel', 'passport']
order by score
limit 5
In the long run, storing the tags as an array or a jsonb array is probably a lot more efficient.
If you do that a lot, things might get a bit easier if you create a function for this:
create function tags_array(p_input text)
returns text[]
as
$$
select string_to_array(trim(both '<>' from replace(p_input, '><', ',')), ',');
$$
language sql
immutable;
Then the query is a bit easier to read:
select id, title, score, tags
from posts
where tags_array(tags) #> array['uk', 'travel', 'passport']
order by score
limit 5
You can even create an index for that if you want:
create index on posts using gin ( (tags_array(tags)) );

Postgres count total matches per group

Input data
I have the following association table:
AssociationTable
- Item ID: Integer
- Tag ID: Integer
Referring to the following example data
Item Tag
1 1
1 2
1 3
2 1
and some input list of tags T (e.g. [1, 2])
What I want
For each item, I would like to know which tags were not provided in the input list T.
With our sample data, we'd get:
Item Num missing
1 1
2 0
My thoughts
The best I've done so far is: select "ItemId", count("TagId") as "Num missing" from "AssociationTab" where "TagId" not in (1) group by "ItemId";
The problem here is that items where all tags match will not be included in the output.
You could use a calendar table with anti-join approach:
WITH cte AS (
SELECT t1.Item, t2.Tag
FROM (SELECT DISTINCT Item FROM AssociationTable) t1
CROSS JOIN (SELECT 1 AS Tag UNION ALL SELECT 2) t2
)
SELECT
t1.Item,
COUNT(*) FILTER (WHERE t2.Item IS NULL) AS num_missing
FROM cte t1
LEFT JOIN AssociationTable t2
ON t1.Item = t2.Item AND
t1.Tag = t2.Tag AND
t2.Tag IN (1, 2)
GROUP BY
t1.Item;
Demo
The strategy here is to build a calendar/reference table in the first CTE which contains all combinations of items and tags. Then, we left join this CTE to your association table, aggregate by item, and then detect how many tags are missing for each item.
Simplest solution is
SELECT
ItemId,
count(*) FILTER (WHERE TagId NOT IN (1,2))
FROM AssociationTab
GROUP BY ItemId
Alternatively, if you already have an Items table with the item list, you could do this:
SELECT
i.ItemId,
count(a.TagId)
FROM Items i
LEFT JOIN AssociationTab a ON a.ItemId = i.ItemId AND a.TagId NOT IN (1,2)
GROUP BY i.ItemId
The key is that LEFT JOIN does not remove the Items row if no tags match.

Split and sequentially join string parts in Postgresql

I need to create a DB view with parts of sequential combinations of string parts of a source column. Example:
IN:
tag
--------
A_B_C_D
X_Y_Z
OUT:
subtag
--------
A
A_B
A_B_C
A_B_C_D
X
X_Y
X_Y_Z
The answer seems to be somewhere around WITH RECURSIVE, but I cannot put it all together.
demo:db<>fiddle
SELECT
array_to_string( -- 3
array_agg(t.value) OVER (PARTITION BY tags ORDER BY t.number), --2
'_'
) AS subtag
FROM
tags,
regexp_split_to_table(tag, '_') WITH ORDINALITY as t(value, number) -- 1
Split the string into one row per element. The WITH ORDINALITY adds a row count which can be used to hold the original order of the elements
Using array_agg() window function to aggregate the elements. The ORDER BY makes it cumulative
Reaggregate the array into a string.
You can use a recursive query:
WITH RECURSIVE s AS (
SELECT tag FROM tag
UNION
SELECT regexp_replace(tag, '_[^_]*$', '') FROM s
)
SELECT * FROM s;
tag
---------
A_B_C_D
X_Y_Z
A_B_C
X_Y
A_B
X
A
(7 rows)
The idea is to successively cut off _* at the end.
Thanks a lot #laurenz-albe! There is a problem with your code that it's missing recursion break condition. So I ended up with this:
WITH RECURSIVE s AS (
SELECT tag FROM tag
UNION
SELECT regexp_replace(tag, '_[^_]*$', '')
FROM s
WHERE tag LIKE '%\_%'
)
SELECT * FROM s;
db<>fiddle

How can I SUM distinct records in a Postgres database where there are duplicate records?

Imagine a table that looks like this:
The SQL to get this data was just SELECT *
The first column is "row_id" the second is "id" - which is the order ID and the third is "total" - which is the revenue.
I'm not sure why there are duplicate rows in the database, but when I do a SUM(total), it's including the second entry in the database, even though the order ID is the same, which is causing my numbers to be larger than if I select distinct(id), total - export to excel and then sum the values manually.
So my question is - how can I SUM on just the distinct order IDs so that I get the same revenue as if I exported to excel every distinct order ID row?
Thanks in advance!
Easy - just divide by the count:
select id, sum(total) / count(id)
from orders
group by id
See live demo.
Also handles any level of duplication, eg triplicates etc.
You can try something like this (with your example):
Table
create table test (
row_id int,
id int,
total decimal(15,2)
);
insert into test values
(6395, 1509, 112), (22986, 1509, 112),
(1393, 3284, 40.37), (24360, 3284, 40.37);
Query
with distinct_records as (
select distinct id, total from test
)
select a.id, b.actual_total, array_agg(a.row_id) as row_ids
from test a
inner join (select id, sum(total) as actual_total from distinct_records group by id) b
on a.id = b.id
group by a.id, b.actual_total
Result
| id | actual_total | row_ids |
|------|--------------|------------|
| 1509 | 112 | 6395,22986 |
| 3284 | 40.37 | 1393,24360 |
Explanation
We do not know what the reasons is for orders and totals to appear more than one time with different row_id. So using a common table expression (CTE) using the with ... phrase, we get the distinct id and total.
Under the CTE, we use this distinct data to do totaling. We join ID in the original table with the aggregation over distinct values. Then we comma-separate row_ids so that the information looks cleaner.
SQLFiddle example
http://sqlfiddle.com/#!15/72639/3
Create custom aggregate:
CREATE OR REPLACE FUNCTION sum_func (
double precision, pg_catalog.anyelement, double precision
)
RETURNS double precision AS
$body$
SELECT case when $3 is not null then COALESCE($1, 0) + $3 else $1 end
$body$
LANGUAGE 'sql';
CREATE AGGREGATE dist_sum (
pg_catalog."any",
double precision)
(
SFUNC = sum_func,
STYPE = float8
);
And then calc distinct sum like:
select dist_sum(distinct id, total)
from orders
SQLFiddle
You can use DISTINCT in your aggregate functions:
SELECT id, SUM(DISTINCT total) FROM orders GROUP BY id
Documentation here: https://www.postgresql.org/docs/9.6/static/sql-expressions.html#SYNTAX-AGGREGATES
If we can trust that the total for 1 order is actually 1 row. We could eliminate the duplicates in a sub-query by selecting the the MAX of the PK id column. An example:
CREATE TABLE test2 (id int, order_id int, total int);
insert into test2 values (1,1,50);
insert into test2 values (2,1,50);
insert into test2 values (5,1,50);
insert into test2 values (3,2,100);
insert into test2 values (4,2,100);
select order_id, sum(total)
from test2 t
join (
select max(id) as id
from test2
group by order_id) as sq
on t.id = sq.id
group by order_id
sql fiddle
In difficult cases:
select
id,
(
SELECT SUM(value::int4)
FROM jsonb_each_text(jsonb_object_agg(row_id, total))
) as total
from orders
group by id
I would suggest just use a sub-Query:
SELECT "a"."id", SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
GROUP BY "a"."id"
The Above will give you the total of each id
Use below if you want the full total of each duplicate removed:
SELECT SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
Using subselect (http://sqlfiddle.com/#!7/cef1c/51):
select sum(total) from (
select distinct id, total
from orders
)
Using CTE (http://sqlfiddle.com/#!7/cef1c/53):
with distinct_records as (
select distinct id, total from orders
)
select sum(total) from distinct_records;

Retrieving the rows using join query

I have two tables like this
A B
---- -------
col1 col2 col1 col2
---------- -----------
A table contains 300k rows
B table contains 400k rows
I need to count the col1 for table A if it is matching col1 for table B
I have written a query like this:
select count(distinct ab.col1) from A ab join B bc on(ab.col1=bc.col1)
but this takes too much time
could try a group by...
Also ensure that the col1 is indexed in both tables
SELECT COUNT (col1 )
FROM
(
SELECT aa.col1
FROM A aa JOIN B bb on aa.col1 = bb.col1
GROUP BY (aa.col1)
)
It's difficult to answer without you positing more details: did you analyze the tables? Do you have an index on col1 on each table? How many rows are you counting?
That being said, there aren'y so many potential query plans for your query. You likely have two seq scans that are hash joined together, which is about the best you can do... If you've a material numbers of rows, you'll be counting a gazillion rows, and this takes time.
Perhaps you could rewrite the query differently? If every B.col1 is in A.col1, you could get the same result without the join:
select count(distinct col1) from B
If A has low cardinality, it might be faster to rely on exists():
with vals as (
select distinct A.col1 as val from A
)
select count(*) from vals
where exists(select 1 from B where B.col1 = vals.val)
Or, if you know every possible value from A.col1 and it's reasonably small, you could unnest an array without querying A at all:
select count(*) from unnest(Array[val1, val2, ...]) as vals (val)
where exists(select 1 from B where B.col1 = vals.val)
Or vice-versa, in each of the above, if every B holds the reference values.