Remove duplicates based on condition and keep oldest element - postgresql

Currently, the table is ordered in ascending order by row_number. I need help removing duplicates based on 2 conditions.
If there is a stage, that is online then I want to keep that row, doesn't matter which one, there can be multiple.
If there isn't a row with online for that org_id, then I keep row_number = 1 which would be the oldest element.
sales_id
org_id
stage
row_number
ccc_123
ccc
off-line
1
ccc_123
ccc
off-line
2
ccc_123
ccc
online
3
abc_123
abc
off-line
1
abc_123
abc
power-off
2
zzz_123
zzz
power-off
1
so the table should look like this after:
sales_id
org_id
stage
ccc_123
ccc
online
abc_123
abc
off-line
zzz_123
zzz
power-off
Looks like this, stackoverflow not working well with second table for some reason

I would use a combination of a CASE statement to modify the rownumber of records with stage='online' and then use ROW_NUMBER to allow me to filter for the lowest value in a group.
http://sqlfiddle.com/#!17/1421b/5
create table sales_stage (
sales_id varchar,
org_id varchar,
stage varchar,
row_num int);
insert into sales_stage (sales_id, org_id, stage, row_num) values
('ccc_123', 'ccc', 'off-line', 1),
('ccc_123', 'ccc', 'off-line', 2),
('ccc_123', 'ccc', 'online', 3),
('abc_123', 'abc', 'off-line', 1),
('abc_123', 'abc', 'power-off', 2),
('zzz_123', 'zzz', 'power-off', 1);
SELECT
sales_id, org_id, stage
FROM
(
SELECT
sales_id, org_id, stage,
ROW_NUMBER() OVER(PARTITION BY sales_id, org_id ORDER BY row_num) as rn
FROM (
SELECT sales_id, org_id, stage,
CASE WHEN stage='online' THEN -999 ELSE row_num END as row_num
FROM sales_stage
) x
) y
WHERE rn = 1

Related

Subtracting values by id and a previous counter

Here is a Snippet of my data :-
customers order_id order_date order_counter
1 a 1/1/2018 1
1 b 1/4/2018 2
1 c 3/8/2018 3
1 d 4/9/2019 4
I'm trying to get the average number of days between the order time for each customer. So for the following Snippet the average number of days should be 32.66 days as there were 3,62,32 number of days between each order, sum it, and then divide by 3.
My data has Customers that may have more than 100+ orders .
You could use LAG function:
WITH cte AS (
SELECT customers,order_date-LAG(order_date) OVER(PARTITION BY customers ORDER BY order_counter) AS d
FROM t
)
SELECT customers, AVG(d)
FROM cte
WHERE d IS NOT NULL
GROUP BY customers;
db<>fiddle demo
With a self join, group by customer and get the average difference:
select
t.customers,
round(avg(tt.order_date - t.order_date), 2) averagedays
from tablename t inner join tablename tt
on tt.customers = t.customers and tt.order_counter = t.order_counter + 1
group by t.customers
See the demo.
Results:
| customers | averagedays |
| --------- | ----------- |
| 1 | 32.67 |
Please check below query.
I tried to insert data of two customers so that we can check that average for every customer is coming correct.
DB Fiddle Example: https://www.db-fiddle.com/
CREATE TABLE test (
customers INTEGER,
order_id VARCHAR(1),
order_date DATE,
order_counter INTEGER
);
INSERT INTO test
(customers, order_id, order_date, order_counter)
VALUES
('1', 'a', '2018-01-01', '1'),
('1', 'b', '2018-01-04', '2'),
('1', 'c', '2018-03-08', '3'),
('1', 'd', '2018-04-09', '4'),
('2', 'a', '2018-01-01', '1'),
('2', 'b', '2018-01-06', '2'),
('2', 'c', '2018-03-12', '3'),
('2', 'd', '2018-04-15', '4');
commit;
select customers , round(avg(next_order_diff),2) as average
from
(
select customers , order_date , next_order_date - order_date as next_order_diff
from
(
select customers ,
lead(order_date) over (partition by customers order by order_date) as next_order_date , order_date
from test
) a
where next_order_date is not null
) a
group by customers
order by customers
;
Another option. I would myself like the answer from #forpas except that it depends on the monotonically increasing value for order_counter (what happens when an order is deleted). The following accounts for that by actually counting the number of order pairs. It also accounts for customers have places only 1 order, returning NULL as the average.
select customers, round(sum(nd)::numeric/n, 2) avg_days_to_order
from (
select customers
, order_date - lag(order_date) over(partition by customers order by order_counter) nd
, count(*) over (partition by customers) - 1 n
from test
)d
group by customers, n
order by customers;

Populate random data from another table

update dataset1.test
set column4 = (select column1
from dataset2
order by random()
limit 1
)
I have to update dataset1 of column 4 with each row updating a random entry from dataset 2 column.. But by far now in this above query I get only one random entry in all the rows of dataset1 and its all same which I want it to be random.
SETUP
Let's start by assuming your tables an data are the following ones.
Note that I assume that dataset1 has a primary key (it can be a composite one, but, for the sake of simplicity, let's make it an integer):
CREATE TABLE dataset1
(
id INTEGER PRIMARY KEY,
column4 TEXT
) ;
CREATE TABLE dataset2
(
column1 TEXT
) ;
We fill both tables with sample data
INSERT INTO dataset1
(id, column4)
SELECT
i, 'column 4 for id ' || i
FROM
generate_series(101, 120) AS s(i);
INSERT INTO dataset2
(column1)
SELECT
'SOMETHING ' || i
FROM
generate_series (1001, 1020) AS s(i) ;
Sanity check:
SELECT count(DISTINCT column4) FROM dataset1 ;
| count |
| ----: |
| 20 |
Case 1: number of rows in dataset1 <= rows in dataset2
We'll perform a complete shuffling. Values from dataset2 will be used once, and no more than once.
EXPLANATION
In order to make an update that shuffles all the values from column4 in a
random fashion, we need some intermediate steps.
First, for the dataset1, we need to create a list (relation) of tuples (id, rn), that
are just:
(id_1, 1),
(id_2, 2),
(id_3, 3),
...
(id_20, 20)
Where id_1, ..., id_20 are the ids present on dataset1.
They can be of any type, they need not be consecutive, and they can be composite.
For the dataset2, we need to create another list of (column_1,rn), that looks like:
(column1_1, 17),
(column1_2, 3),
(column1_3, 11),
...
(column1_20, 15)
In this case, the second column contains all the values 1 .. 20, but shuffled.
Once we have the two relations, we JOIN them ON ... rn. This, in practice, produces yet another list of tuples with (id, column1), where the pairing has been done randomly. We use these pairs to update dataset1.
THE REAL QUERY
This can all be done (clearly, I hope) by using some CTE (WITH statement) to hold the intermediate relations:
WITH original_keys AS
(
-- This creates tuples (id, rn),
-- where rn increases from 1 to number or rows
SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
)
, shuffled_data AS
(
-- This creates tuples (column1, rn)
-- where rn moves between 1 and number of rows, but is randomly shuffled
SELECT
column1,
-- The next statement is what *shuffles* all the data
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
)
-- You update your dataset1
-- with the shuffled data, linking back to the original keys
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
shuffled_data
JOIN original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;
Note that the trick is performed by means of:
row_number() OVER (ORDER BY random()) AS rn
The row_number() window function that produces as many consecutive numbers as there are rows, starting from 1.
These numbers are randomly shuffled because the OVER clause takes all the data and sorts it randomly.
CHECKS
We can check again:
SELECT count(DISTINCT column4) FROM dataset1 ;
| count |
| ----: |
| 20 |
SELECT * FROM dataset1 ;
id | column4
--: | :-------------
101 | SOMETHING 1016
102 | SOMETHING 1009
103 | SOMETHING 1003
...
118 | SOMETHING 1012
119 | SOMETHING 1017
120 | SOMETHING 1011
ALTERNATIVE
Note that this can also be done with subqueries, by simple substitution, instead of CTEs. That might improve performance in some occasions:
UPDATE
dataset1
SET
column4 = shuffled_data.column1
FROM
(SELECT
column1,
row_number() OVER (ORDER BY random()) AS rn
FROM
dataset2
) AS shuffled_data
JOIN
(SELECT
id,
row_number() OVER () AS rn
FROM
dataset1
) AS original_keys ON original_keys.rn = shuffled_data.rn
WHERE
dataset1.id = original_keys.id ;
And again...
SELECT * FROM dataset1;
id | column4
--: | :-------------
101 | SOMETHING 1011
102 | SOMETHING 1018
103 | SOMETHING 1007
...
118 | SOMETHING 1020
119 | SOMETHING 1002
120 | SOMETHING 1016
You can check the whole setup and experiment at dbfiddle here
NOTE: if you do this with very large datasets, don't expect it to be extremely fast. Shuffling a very big deck of cards is expensive.
Case 2: number of rows in dataset1 > rows in dataset2
In this case, values for column4 can be repeated several times.
The easiest possibility I can think of (probably, not an efficient one, but easy to understand) is to create a function random_column1, marked as VOLATILE:
CREATE FUNCTION random_column1()
RETURNS TEXT
VOLATILE -- important!
LANGUAGE SQL
AS
$$
SELECT
column1
FROM
dataset2
ORDER BY
random()
LIMIT
1 ;
$$ ;
And use it to update:
UPDATE
dataset1
SET
column4 = random_column1();
This way, some values from dataset2 might not be used at all, whereas others will be used more than once.
dbfiddle here
Better is to reference the outer table from the subquery. Then the subquery has to be evalued for every row:
update dataset1.test
set column4 = (select
case when dataset1.test.column4 = dataset1.test.column4
then column1 end
from dataset2
order by random()
limit 1
)

Find all records NOT in any blocked range where blocked ranges are in a table

I have a table TaggedData with the following fields and data
ID GroupID Tag MyData
** ******* *** ******
1 Texas AA01 Peanut Butter
2 Texas AA15 Cereal
3 Ohio AA05 Potato Chips
4 Texas AA08 Bread
I have a second table of BlockedTags as follows:
ID StartTag EndTag
** ******** ******
1 AA00 AA04
2 AA15 AA15
How do I select from this to return all data matching a given GroupId but NOT in any blocked range (inclusive)? For the data given if the GroupId is Texas, I don't want to return Cereal because it matches the second range. It should only return Bread.
I did try left joins based queries but I'm not even that close.
Thanks
create table TaggedData (
ID int,
GroupID varchar(16),
Tag char(4),
MyData varchar(50))
create table BlockedTags (
ID int,
StartTag char(4),
EndTag char(4)
)
insert into TaggedData(ID, GroupID, Tag, MyData)
values (1, 'Texas', 'AA01', 'Peanut Butter')
insert into TaggedData(ID, GroupID, Tag, MyData)
values (2, 'Texas' , 'AA15', 'Cereal')
insert into TaggedData(ID, GroupID, Tag, MyData)
values (3, 'Ohio ', 'AA05', 'Potato Chips')
insert into TaggedData(ID, GroupID, Tag, MyData)
values (4, 'Texas', 'AA08', 'Bread')
insert into BlockedTags(ID, StartTag, EndTag)
values (1, 'AA00', 'AA04')
insert into BlockedTags(ID, StartTag, EndTag)
values (2, 'AA15', 'AA15')
select t.* from TaggedData t
left join BlockedTags b on t.Tag between b.StartTag and b.EndTag
where b.ID is null
Returns:
ID GroupID Tag MyData
----------- ---------------- ---- --------------------------------------------------
3 Ohio AA05 Potato Chips
4 Texas AA08 Bread
(2 row(s) affected)
So, to match on given GroupID you change the query like that:
select t.* from TaggedData t
left join BlockedTags b on t.Tag between b.StartTag and b.EndTag
where b.ID is null and t.GroupID=#GivenGroupID
I Prefer the NOT EXISTS simply because it gives you more readability, usability and better performance usually in large data (several cases get better execution plans):
would be like this:
SELECT * from TaggedData
WHERE GroupID=#GivenGroupID
AND NOT EXISTS(SELECT 1 FROM BlockedTags WHERE Tag BETWEEN StartTag ANDEndTag)

Make a column values header for rest of columns using TSQL

I have following table
ID | Group | Type | Product
1 Dairy Milk Fresh Milk
2 Dairy Butter Butter Cream
3 Beverage Coke Coca cola
4 Beverage Diet Dew
5 Beverage Juice Fresh Juice
I need following output/query result:
ID | Group | Type | Product
1 Dairy
1 Milk Fresh Milk
2 Butter Butter Cream
2 Beverage
1 Coke Coca cola
2 Diet Dew
3 Juice Fresh Juice
For above sample a hard coded script can do the job but I look for a dynamic script for any number of groups. I do not have any idea how it can be done so, I do not have a sample query yet. I need ideas, examples that at least give me an idea. PIVOT looks a close option but does not looks to be fully fit for this case.
Here's a possible way. It basically unions the "Group-Headers" and the "Group-Items". The difficulty was to order them correctly.
WITH CTE AS
(
SELECT ID,[Group],Type,Product,
ROW_NUMBER() OVER (PARTITION BY [Group] Order By ID)AS RN
FROM Drink
)
SELECT ID,[Group],Type,Product
FROM(
SELECT RN AS ID,[Group],[Id]AS OriginalId,'' As Type,'' As Product, 0 AS RN, 'Group' As RowType
FROM CTE WHERE RN = 1
UNION ALL
SELECT RN AS ID,'' AS [Group],[Id]AS OriginalId,Type,Product, RN, 'Item' As RowType
FROM CTE
)X
ORDER BY OriginalId ASC
, CASE WHEN RowType='Group' THEN 0 ELSE 1 END ASC
, RN ASC
Here's a demo-fiddle: http://sqlfiddle.com/#!6/ed6ca/2/0
A slightly simplified approach:
With Groups As
(
Select Distinct Min(Id) As Id, [Group], '' As [Type], '' As Product
From dbo.Source
Group By [Group]
)
Select Coalesce(Cast(Z.Id As varchar(10)),'') As Id
, Coalesce(Z.[Group],'') As [Group]
, Z.[Type], Z.Product
From (
Select Id As Sort, Id, [Group], [Type], Product
From Groups
Union All
Select G.Id, Null, Null, S.[Type], S.Product
From dbo.Source As S
Join Groups As G
On G.[Group] = S.[Group]
) As Z
Order By Sort
It should be noted that the use of Coalesce is purely for aesthetic reasons. You could simply return null in these cases.
SQL Fiddle
And an approach with ROW_NUMBER:
IF OBJECT_ID('dbo.grouprows') IS NOT NULL DROP TABLE dbo.grouprows;
CREATE TABLE dbo.grouprows(
ID INT,
Grp NVARCHAR(MAX),
Type NVARCHAR(MAX),
Product NVARCHAR(MAX)
);
INSERT INTO dbo.grouprows VALUES
(1,'Dairy','Milk','Fresh Milk'),
(2,'Dairy','Butter','Butter Cream'),
(3,'Beverage','Coke','Coca cola'),
(4,'Beverage','Diet','Dew'),
(5,'Beverage','Juice','Fresh Juice');
SELECT
CASE WHEN gg = 0 THEN dr1 END GrpId,
CASE WHEN gg = 1 THEN rn1 END TypeId,
ISNULL(Grp,'')Grp,
CASE WHEN gg = 1 THEN Type ELSE '' END Type,
CASE WHEN gg = 1 THEN Product ELSE '' END Product
FROM(
SELECT *,
DENSE_RANK()OVER(ORDER BY Grp DESC) dr1
FROM(
SELECT *,
ROW_NUMBER()OVER(PARTITION BY Grp ORDER BY type,gg) rn1,
ROW_NUMBER()OVER(ORDER BY type,gg) rn0
FROM(
SELECT Grp,Type,Product, GROUPING(Grp) gg, GROUPING(type) tg FROM dbo.grouprows
GROUP BY Product, Type, Grp
WITH ROLLUP
)X1
WHERE tg = 0
)X2
WHERE gg=1 OR rn1 = 1
)X3
ORDER BY rn0

Subtract the previous row of data where the id is the same as the row above

I have been trying all afternoon to try and achieve this with no success.
I have a db in with info on customers and the date that they purchase products from the store. It is grouped by a batch ID which I have converted into a date format.
So in my table I now have:
CustomerID|Date
1234 |2011-10-18
1234 |2011-10-22
1235 |2011-11-16
1235 |2011-11-17
What I want to achieve is to see the number of days between the most recent purchase and the last purchase and so on.
For example:
CustomerID|Date |Outcome
1234 |2011-10-18 |
1234 |2011-10-22 | 4
1235 |2011-11-16 |
1235 |2011-11-17 | 1
I have tried joining the table to itself but the problem I have is that I end up joining in the same format. I then tried with my join statement to return where it did <> match date.
Hope this makes sense, any help appreciated. I have searched all the relevant topics on here.
Will there be multiple groups of CustomerID? Or only and always grouped together?
DECLARE #myTable TABLE
(
CustomerID INT,
Date DATETIME
)
INSERT INTO #myTable
SELECT 1234, '2011-10-14' UNION ALL
SELECT 1234, '2011-10-18' UNION ALL
SELECT 1234, '2011-10-22' UNION ALL
SELECT 1234, '2011-10-26' UNION ALL
SELECT 1235, '2011-11-16' UNION ALL
SELECT 1235, '2011-11-17' UNION ALL
SELECT 1235, '2011-11-18' UNION ALL
SELECT 1235, '2011-11-19'
SELECT CustomerID,
MIN(date),
MAX(date),
DATEDIFF(day,MIN(date),MAX(date)) Outcome
FROM #myTable
GROUP BY CustomerID
SELECT a.CustomerID,
a.[Date],
ISNULL(DATEDIFF(DAY, b.[Date], a.[Date]),0) Outcome
FROM
(
SELECT ROW_NUMBER() OVER(PARTITION BY [CustomerID] ORDER BY date) Row,
CustomerID,
Date
FROM #myTable
) A
LEFT JOIN
(
SELECT ROW_NUMBER() OVER(PARTITION BY [CustomerID] ORDER BY date) Row,
CustomerID,
Date
FROM #myTable
) B ON a.CustomerID = b.CustomerID AND A.Row = B.Row + 1