Postgres array comparison - find missing elements - postgresql

I have the table below.
╔════════════════════╦════════════════════╦═════════════╗
║id ║arr1 ║arr2 ║
╠════════════════════╬════════════════════╬═════════════╣
║1 ║{1,2,3,4} ║{2,1,7} ║
║2 ║{0} ║{3,4,5} ║
╚════════════════════╩════════════════════╩═════════════╝
I want to find out the elements which are in arr1 and not in arr2.
Expected output
╔════════════════════╦════════════════════╗
║id ║diff ║
╠════════════════════╬════════════════════╣
║1 ║{3,4} ║
║2 ║{0} ║
╚════════════════════╩════════════════════╝
If I have 2 individual arrays, I can do as follows:
select array_agg(elements)
from (
select unnest(array[0])
except
select unnest(array[3,4,5])
) t (elements)
But I am unable to integrate this code to work by selecting from my table.
Any help would be highly appreciated. Thank you!!

I would write a function for this:
create function array_diff(p_one int[], p_other int[])
returns int[]
as
$$
select array_agg(item)
from (
select *
from unnest(p_one) item
except
select *
from unnest(p_other)
) t
$$
language sql
stable;
Then you can use it like this:
select id, array_diff(arr1, arr2)
from the_table
A much faster alternative is to install the intarray module and use
select id, arr1 - arr2
from the_table

You should use except for each id and after that group by for each group
Demo
with diff_data as (
select id, unnest(arr1) as data
from test_table
except
select id, unnest(arr2) as data
from test_table
)
select id, array_agg(data order by data) as diff
from diff_data
group by id

Related

How to find posts tagged with any of the predefined tags in Postgresql

I have posts table with the following structure:
| id | score | title | tags |
-------------------------------------------------
| 1 | 42 | Travel | <uk><travel><passport> |
For each blog post I want to find relevant posts, tagged with any of the tags corresponding to the current page, in my case: <uk>, <travel> or <passport>. Then, order results by score, limit it to 5 items and display it to the user.
This is the code I came up with so far, but it seems only getting the result for the first tag in the query – <uk>.
with tags_string (tag) as (
select unnest(string_to_array('<uk><travel><passport>', '>'))
)
select *
from
(
select distinct *
from posts
cross join tags_string
cross join lateral
(select
(tags ~ tag)::int as match_found
) m
where m.match_found > 0
) t
order by t.score desc
limit 5;
EDIT
After #Mike Organek's comment I changed the query this, and it's working as I initially expected.
with tags_string (tag) as (
select unnest(string_to_array('<uk><travel><passport>', '>'))
)
select *
from
(
select distinct *
from posts
cross join tags_string
cross join lateral
(select
position(tag in tags) > 0 as match_found
) m
where m.match_found and tag <> ''
) t
order by t.score desc
limit 5;
I would convert the tags into an array then use array operators to find the relevant posts:
select id, title, score, tags
from posts
where string_to_array(trim(both '<>' from replace(tags, '><', ',')), ',') #> array['uk', 'travel', 'passport']
order by score
limit 5
In the long run, storing the tags as an array or a jsonb array is probably a lot more efficient.
If you do that a lot, things might get a bit easier if you create a function for this:
create function tags_array(p_input text)
returns text[]
as
$$
select string_to_array(trim(both '<>' from replace(p_input, '><', ',')), ',');
$$
language sql
immutable;
Then the query is a bit easier to read:
select id, title, score, tags
from posts
where tags_array(tags) #> array['uk', 'travel', 'passport']
order by score
limit 5
You can even create an index for that if you want:
create index on posts using gin ( (tags_array(tags)) );

"Subquery returns more than 1 value" when trying to INSERT

I'm trying to INSERT into a table a column that is part of another column in another table using TSQL, but I get the error stating that there is more than one value returned when I used that subquery as an expression. I understand what causes the error, but I can't seem to think of a way to make it produce what I want.
I'm trying to do something similar to:
A.Base B.Reference C.Wanted
--- ---- ----
abcdaa aa abcdaa
bcdeab bb cdefbb
cdefbb cc efghcc
defgbc ddd fghddd
efghcc
fghddd
So I'm using the code:
INSERT INTO C ( [Some other column], Wanted )
SELECT
A.[Some other column],
, CASE
WHEN LEN( B.Reference ) = 2 THEN
( SELECT A.Base FROM A WHERE RIGHT( A.Base, 2 ) =
( SELECT B.Reference FROM B WHERE LEN( B.Reference ) = 2 )
)
WHEN LEN( B.Reference ) = 3 THEN
( SELECT A.Base FROM A WHERE RIGHT( A.Base, 3 ) =
( SELECT B.Reference FROM B WHERE LEN( B.Reference ) = 3 )
)
END
FROM
A
, B
Which will return me the "more than 1 value" error. Honestly, I'm probably making this way more convoluted than it needs to be, but I've been staring at these tables for a while now.
I hope I'm getting the idea across as to what I'm trying to do.
If you know the records aren't duplicate, and you are sure your JOIN between A and B works (as Martin mentioned) can't you just select distinct to return just the unique records?
I'd try it like this:
--Create a mockup with declared table variables and test data
DECLARE #tblA TABLE(someColumnInA VARCHAR(100));
DECLARE #tblB TABLE(someColumnInB VARCHAR(100));
DECLARE #tblC TABLE(someColumnInC VARCHAR(100));
INSERT INTO #tblA VALUES
('abcdaa')
,('bcdeab')
,('cdefbb')
,('defgbc')
,('efghcc')
,('fghddd')
INSERT INTO #tblB VALUES
('aa')
,('bb')
,('cc')
,('ddd');
--The query
INSERT INTO #tblC(someColumnInC)
SELECT SomeColumnInA
FROM #tblA a
WHERE EXISTS(SELECT 1 FROM #tblB b WHERE a.someColumnInA LIKE '%' + b.SomeColumnInB + '%');
SELECT * FROM #tblC;
The idea in short:
After creating a mockup (please do this next time in advance) we use a query to insert all values from #tblA into #tblC as long as there exists any value in #tblB, which is part of the current value in #tblA.
How about doing something like this?
select *
from A
where RIGHT(A.Base,2) IN (select B.Reference FROM B WHERE LEN(B.Reference) = 2)
UNION ALL
select *
from A
where RIGHT(A.Base,3) IN (select B.Reference FROM B WHERE LEN(B.Reference) = 3)

Nesting and using LIKE and OR in postgres

I'm trying to nest this query, but I am getting the error: invalid input syntax for type boolean: "%malfunction%".
select *
from (
select column_one, column_two
from table
group by column_one, column_two
) as new_table
where column_two like '%false%' or '%malfunction%' or '%accidental%' or '%mistaken%'
order by column_one
Column_two is not boolean but it's identifying it as one.
I feel like I'm missing something small, but I can't find it. Help!
You can use any(array[...]), example:
with test (col) as (
values
('pear'), ('banana'), ('apple')
)
select *
from test
where col like any(array['%ea%', '%ba%']);
col
--------
pear
banana
(2 rows)
Correct syntax is something like select col1, col2,
from table_name
where condition1 OR condition2 OR condition3 ...;

How can I SUM distinct records in a Postgres database where there are duplicate records?

Imagine a table that looks like this:
The SQL to get this data was just SELECT *
The first column is "row_id" the second is "id" - which is the order ID and the third is "total" - which is the revenue.
I'm not sure why there are duplicate rows in the database, but when I do a SUM(total), it's including the second entry in the database, even though the order ID is the same, which is causing my numbers to be larger than if I select distinct(id), total - export to excel and then sum the values manually.
So my question is - how can I SUM on just the distinct order IDs so that I get the same revenue as if I exported to excel every distinct order ID row?
Thanks in advance!
Easy - just divide by the count:
select id, sum(total) / count(id)
from orders
group by id
See live demo.
Also handles any level of duplication, eg triplicates etc.
You can try something like this (with your example):
Table
create table test (
row_id int,
id int,
total decimal(15,2)
);
insert into test values
(6395, 1509, 112), (22986, 1509, 112),
(1393, 3284, 40.37), (24360, 3284, 40.37);
Query
with distinct_records as (
select distinct id, total from test
)
select a.id, b.actual_total, array_agg(a.row_id) as row_ids
from test a
inner join (select id, sum(total) as actual_total from distinct_records group by id) b
on a.id = b.id
group by a.id, b.actual_total
Result
| id | actual_total | row_ids |
|------|--------------|------------|
| 1509 | 112 | 6395,22986 |
| 3284 | 40.37 | 1393,24360 |
Explanation
We do not know what the reasons is for orders and totals to appear more than one time with different row_id. So using a common table expression (CTE) using the with ... phrase, we get the distinct id and total.
Under the CTE, we use this distinct data to do totaling. We join ID in the original table with the aggregation over distinct values. Then we comma-separate row_ids so that the information looks cleaner.
SQLFiddle example
http://sqlfiddle.com/#!15/72639/3
Create custom aggregate:
CREATE OR REPLACE FUNCTION sum_func (
double precision, pg_catalog.anyelement, double precision
)
RETURNS double precision AS
$body$
SELECT case when $3 is not null then COALESCE($1, 0) + $3 else $1 end
$body$
LANGUAGE 'sql';
CREATE AGGREGATE dist_sum (
pg_catalog."any",
double precision)
(
SFUNC = sum_func,
STYPE = float8
);
And then calc distinct sum like:
select dist_sum(distinct id, total)
from orders
SQLFiddle
You can use DISTINCT in your aggregate functions:
SELECT id, SUM(DISTINCT total) FROM orders GROUP BY id
Documentation here: https://www.postgresql.org/docs/9.6/static/sql-expressions.html#SYNTAX-AGGREGATES
If we can trust that the total for 1 order is actually 1 row. We could eliminate the duplicates in a sub-query by selecting the the MAX of the PK id column. An example:
CREATE TABLE test2 (id int, order_id int, total int);
insert into test2 values (1,1,50);
insert into test2 values (2,1,50);
insert into test2 values (5,1,50);
insert into test2 values (3,2,100);
insert into test2 values (4,2,100);
select order_id, sum(total)
from test2 t
join (
select max(id) as id
from test2
group by order_id) as sq
on t.id = sq.id
group by order_id
sql fiddle
In difficult cases:
select
id,
(
SELECT SUM(value::int4)
FROM jsonb_each_text(jsonb_object_agg(row_id, total))
) as total
from orders
group by id
I would suggest just use a sub-Query:
SELECT "a"."id", SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
GROUP BY "a"."id"
The Above will give you the total of each id
Use below if you want the full total of each duplicate removed:
SELECT SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
Using subselect (http://sqlfiddle.com/#!7/cef1c/51):
select sum(total) from (
select distinct id, total
from orders
)
Using CTE (http://sqlfiddle.com/#!7/cef1c/53):
with distinct_records as (
select distinct id, total from orders
)
select sum(total) from distinct_records;

T-SQL how to count the number of duplicate rows then print the outcome?

I have a table ProductNumberDuplicates_backups, which has two columns named ProductID and ProductNumber. There are some duplicate ProductNumbers. How can I count the distinct number of products, then print out the outcome like "() products was backup." ? Because this is inside a stored procedure, I have to use a variable #numrecord as the distinct number of rows. I put my codes like this:
set #numrecord= select distinct ProductNumber
from ProductNumberDuplicates_backups where COUNT(*) > 1
group by ProductID
having Count(ProductNumber)>1
Print cast(#numrecord as varchar)+' product(s) were backed up.'
obviously the error was after the = sign as the select can not follow it. I've search for similar cases but they are just select statements. Please help. Many thanks!
Try
select #numrecord= count(distinct ProductNumber)
from ProductNumberDuplicates_backups
Print cast(#numrecord as varchar)+' product(s) were backed up.'
begin tran
create table ProductNumberDuplicates_backups (
ProductNumber int
)
insert ProductNumberDuplicates_backups(ProductNumber)
select 1
union all
select 2
union all
select 1
union all
select 3
union all
select 2
select * from ProductNumberDuplicates_backups
declare #numRecord int
select #numRecord = count(ProductNumber) from
(select ProductNumber, ROW_NUMBER()
over (partition by ProductNumber order by ProductNumber) RowNumber
from ProductNumberDuplicates_backups) p
where p.RowNumber > 1
print cast(#numRecord as varchar) + ' product(s) were backed up.'
rollback