PostgreSQL - Can I get the inverse of distinct rows? - postgresql

I have a table of contacts. Each contact has an associating website. Each website can have multiple contacts.
I ran a query to get one contact with Select distinct on (website). This works fine.
But I want to do something the the rest of the data not selected but Select distinct on (website). Is there an inverse command where I can find all records from websites that have NOT been processed?

Use except. Here is an illustration. order by is for clarity.
create temporary table the_table (i integer, tx text);
insert into the_table values
(1, 'one'),
(1, 'one more one'),
(1, 'yet another one'),
(2, 'two'),
(2, 'one more two'),
(2, 'yet another two'),
(3, 'three'),
(3, 'three alternative');
select * from the_table
EXCEPT
select distinct on (i) * from the_table
order by i;
i
tx
1
one more one
1
yet another one
2
yet another two
2
one more two
3
three alternative

Related

Best way to avoid duplicates in table?

I've been given a task that requires writing a script to mass change items in a table(ProductArea):
ProductID int
SalesareaID int
One ProductID can only exist once in each SalesareaID so there can't be any duplicates in this table. But one ProductID can be sold in multiple SalesareaID.
So an example would look something like:
ProductID SalesareaID
1 1
1 2
1 3
2 2
3 1
Now, some areas have merged. So, if I try to run a straight-forward UPDATE to fix this like:
UPDATE ProductArea SET SalesareaID = 4 where SalesareaID IN (2, 3)
it will find (1, 2) and change that to (1, 4). Then it will find (1, 3) and try to change that to (1, 4). But that already exist so it will crash with a "Cannot insert duplicate key..."-error.
Is there a best/recommended way to tell my UPDATE to only update if the resulting (ProductID, SalesareaID) doesn't already exist?
This should work
It uses a window function
declare #T table (prodID int, salesID int, primary key (prodID, salesID));
insert into #T values
(1, 1)
, (1, 2)
, (1, 3)
, (2, 2)
, (3, 1);
with cte as
( select t.*
, row_number() over (partition by t.prodID order by t.salesID) as rn
from #T t
where t.salesID in (2, 3)
)
delete cte where rn > 1;
update #T set salesID = 4 where salesID in (2, 3);
select * from #T;
If you are creating a new merged region from existing regions then I think the easiest thing to do would be to treat the merge as two separate operations.
First you insert entries for the new area based on the existing areas.
INSERT INTO ProductArea (ProductID, SalesareaID)
SELECT DISTINCT ProductID, 4 FROM ProductArea
WHERE SalesareaID IN (2, 3)
Then you remove the entries for the existing areas.
DELETE FROM ProductArea WHERE SalesareaID IN (2, 3)
The SalesareaID of 4 would need to be replaced by the id of the new Salesarea. The 2 and 3 would also need to be replaced by the ids of the areas you are merging to create the new Salesarea.

How to filter a query based on a jsonb data?

Not even sure if it's possible to do this kind of query in postgres. At least i'm stuck.
I have two tables: a product recommendation list, containing multiple products to be recommended to a particular customer; and a transaction table indicating the product bought by customer and transaction details.
I'm trying to track the performance of my recommendation by plotting all the transaction that match the recommendations (both customer and product).
Below is my test case.
Kindly help
create table if not exists productRec( --Product Recommendation list
task_id int,
customer_id int,
detail jsonb);
truncate productRec;
insert into productRec values (1, 2, '{"1":{"score":5, "name":"KitKat"},
"4":{"score":2, "name":"Yuppi"}
}'),
(1, 3, '{"1":{"score":3, "name":"Yuppi"},
"4":{"score":2, "name":"GoldenSnack"}
}'),
(1, 4, '{"1":{"score":3, "name":"Chickies"},
"4":{"score":2, "name":"Kitkat"}
}');
drop table txn;
create table if not exists txn( --Transaction table
customer_id int,
item_id text,
txn_value numeric,
txn_date date);
truncate txn;
insert into txn values (1, 'Yuppi', 500, DATE '2001-01-01'), (2, 'Kitkat', 2000, DATE '2001-01-01'),
(3, 'Kitkat', 2000, DATE '2001-02-01'), (4, 'Chickies', 200, DATE '2001-09-01');
--> Query must plot:
--Transaction value vs date where the item_id is inside the recommendation for that customer
--ex: (2000, 2001-01-01), (200, 2001-09-01)
We can get each recommendation as its own row with jsonb_each. I don't know what to do with the keys so I just take the value (still jsonb) and then the name inside it (the ->> outputs text).
select
customer_id,
(jsonb_each(detail)).value->>'name' as name
from productrec
So now we have a list of customer_ids and item_ids they were recommended. Now we can just join this with the transactions.
select
txn.txn_value,
txn.txn_date
from txn
join (
select
customer_id,
(jsonb_each(detail)).value->>'name' as name
from productrec
) p ON (
txn.customer_id = p.customer_id AND
lower(txn.item_id) = lower(p.name)
);
In your example data you spelled Kitkat differently in the recommendation table for customer 2. I added lowercasing in the join condition to counter that but it might not be the right solution.
txn_value | txn_date
-----------+------------
2000 | 2001-01-01
200 | 2001-09-01
(2 rows)

Postgres how can I merge 2 separate select queries into 1

I am using postgres 9.4 and I would like to merge 2 separate queries into one statement. I been looking at this How to merge these queries into 1 using subquery post but still can't figure out how to work it. These 2 queries do work independently. Here they are
# 1: select * from votes v where v.user_id=32 and v.stream_id=130;
#2: select city,state,post,created_on,votes,id as Voted from streams
where latitudes >=28.0363 AND 28.9059>= latitudes order by votes desc limit 5 ;
I would like query #2 to be limited by 5, however I don't want query #1 to be included in that limit so that up to 6 rows could be returned in total. This works like a suggestion engine where query #1 has a main thread and query #2 gives up to 5 different suggestions however they are obviously located in a different table.
Having no model and data I simulated this problem with dummies of both in this SQL Fiddle.
CREATE TABLE votes
(
id smallint
, user_id smallint
);
CREATE TABLE streams
(
id smallint
, foo boolean
);
INSERT INTO votes
VALUES (1, 42), (2, 32), (3, 17), (4, 37), (5, 73), (6, 69), (7, 21), (8, 18), (9, 11), (10, 15), (11, 28);
INSERT INTO streams
VALUES (1, true), (2, true), (3, true), (4, true), (5, true), (6, true), (7, false), (8, false), (9, false), (10, false), (11, false);
SELECT
id
FROM
(SELECT id, 1 AS sort FROM votes WHERE user_id = 32) AS query_1
FULL JOIN (SELECT id FROM streams WHERE NOT foo) AS query_2 USING (id)
ORDER BY
sort
LIMIT 6;
Also I have to point out, that this isn't my work entirely, but an adaptation of this answer I came across the other day. Maybe this is an approach here too.
So, what's going on? Column id stands for any column your tables and sub-queries will have in common. votes.user_id I made to have sth. to select in the one sub-query and streams.foo in the other.
As you demanded to have 6 rows at the most I used the limit clause twice. First in the sub-query just in case there is a huge amount of rows in your table you don't want to select and again in the outer query to finally restrict the number of rows. Fiddle about a little on the two limits and toggle WHERE foo and WHERE NOT foo and you see why.
In the first sub-query I added a sort column like it is done in that answer. That's because I guess you want the result of the first sub-query always on top too.

PostgreSQL Get holes in index column

I suppose it is not easy to query a table for data which don't exists but maybe here is some trick to achieve holes in one integer column (rowindex).
Here is small table for illustrating concrete situation:
DROP TABLE IF EXISTS examtable1;
CREATE TABLE examtable1
(rowindex integer primary key, mydate timestamp, num1 integer);
INSERT INTO examtable1 (rowindex, mydate, num1)
VALUES (1, '2015-03-09 07:12:45', 1),
(3, '2015-03-09 07:17:12', 4),
(5, '2015-03-09 07:22:43', 1),
(6, '2015-03-09 07:25:15', 3),
(7, '2015-03-09 07:41:46', 2),
(10, '2015-03-09 07:42:05', 1),
(11, '2015-03-09 07:45:16', 4),
(14, '2015-03-09 07:48:38', 5),
(15, '2015-03-09 08:15:44', 2);
SELECT rowindex FROM examtable1;
With showed query I get all used indexes listed.
But I would like to get (say) first five indexes which is missed so I can use them for insert new data at desired rowindex.
In concrete example result will be: 2, 4, 8, 9, 12 what represent indexes which are not used.
Is here any trick to build a query which will give n number of missing indexes?
In real, such table may contain many rows and "holes" can be anywhere.
You can do this by generating a list of all numbers using generate_series() and then check which numbers don't exist in your table.
This can either be done using an outer join:
select nr.i as missing_index
from (
select i
from generate_series(1, (select max(rowindex) from examtable1)) i
) nr
left join examtable1 t1 on nr.i = t1.rowindex
where t1.rowindex is null;
or an not exists query:
select i
from generate_series(1, (select max(rowindex) from examtable1)) i
where not exists (select 1
from examtable1 t1
where t1.rowindex = i.i);
I have used a hardcoded lower bound for generate_series() so that you would also detect a missing rowindex that is smaller than the lowest number.

Check SQL Server table values against themselves

Imagine I had this table:
declare #tmpResults table ( intItemId int, strTitle nvarchar(100), intWeight float )
insert into #tmpResults values (1, 'Item One', 7)
insert into #tmpResults values (2, 'Item One v1', 6)
insert into #tmpResults values (3, 'Item Two', 6)
insert into #tmpResults values (4, 'Item Two v1', 7)
And a function, which we'll call fn_Lev that takes two strings, compares them to one another and returns the number of differences between them as an integer (i.e. the Levenshtein distance).
What's the most efficient way to query that table, check the fn_Lev value of each strTitle against all the other strTitles in the table and delete rows are similar to one another by a Levenshtein distance of 3, preferring to keeping higher intWeights?
So the after the delete, #tmpResults should contain
1 Item One 7
4 Item Two v1 7
I can think of ways to do this, but nothing that isn't horribly slow (i.e iterative). I'm sure there's a faster way?
Cheers,
Matt
SELECT strvalue= CASE
WHEN t1.intweight >= t2.intweight THEN t1.strtitle
ELSE t2.strtitle
END,
dist = Fn_lev(t1.strtitle, t2.strtitle)
FROM #tmpResults AS t1
INNER JOIN #tmpResults AS t2
ON t1.intitemid < t2.intitemid
WHERE Fn_lev(t1.strtitle, t2.strtitle) = 3
This will perform a self join that will match each row only once. It will excluding matching a row on itself or reverse of a previous match ie if A<->B is a match then B<->A isn't.
The case statement selects the highest weighted result
If I've understood you correctly, you can use a cross join
SELECT t1.intItemId AS Id1, t2.intItemId AS Id2, fn_Lev(t1.strTitle, t2.strTitle) AS Lev
FROM #tmpResults AS t1
CROSS JOIN #tmpResults AS t2
The cross join will give you the results of every combination of rows between the left and right side of the join (hence it doesn't need any ON clause, as it is matching everything to everything else). You can then use the result of the SELECT to choose which to delete.