Number of consecutive rows in grouping - tsql

I have a table Samples which contains samples of product prices. Note ascending date order.
+----+------------+-------+-------+-------------+
| Id | Product_Id | Price | Status| Date |
+----+------------+-------+-------+-------------+
| 1 | 1 | 400 | 0 | 1404656325 |
| 2 | 2 | 300 | 0 | 1404657325 |
| 3 | 3 | 100 | 0 | 1404658325 |
| 4 | 1 | 400 | 0 | 1404659325 |
| 5 | 2 | 300 | 0 | 1404660325 |
| 6 | 3 | 100 | 1 | 1404661325 |
| 7 | 1 | 500 | 1 | 1404662325 |
| 8 | 2 | 500 | 0 | 1404663325 |
| 9 | 3 | 500 | 1 | 1404664325 |
+----+------------+-------+-------+-------------+
I am interested in grouping Product_Id's such that I have a list of unique product ids along with the latest price (that is biggest date).
This is somewhat the classic greatest-n-per-group problem, but in addition I also want a number column that displays how many consecutive rows the Status column has been the same starting from oldest date.
So considering my example table above, I should end up having
+------------+-------+-----------------+
| Product_Id | Price | SameStatus |
+------------+-------+-----------------+
| 1 | 500 | 1 |
| 2 | 500 | 3 |
| 3 | 500 | 2 |
+------------+-------+-----------------+
I hope it is clear what I want to achieve and there is a friendly person willing to guide me.

This should work.
The approach makes use of ROW_NUMBER()
;WITH Samples (Id, Product_Id, Price, [Status], [Date]) AS
(
SELECT 1, 1, 400, 0, 1404656325 UNION ALL
SELECT 2, 2, 300, 0, 1404657325 UNION ALL
SELECT 3, 3, 100, 0, 1404658325 UNION ALL
SELECT 4, 1, 400, 0, 1404659325 UNION ALL
SELECT 5, 2, 300, 0, 1404660325 UNION ALL
SELECT 6, 3, 100, 1, 1404661325 UNION ALL
SELECT 7, 1, 500, 1, 1404662325 UNION ALL
SELECT 8, 2, 500, 0, 1404663325 UNION ALL
SELECT 9, 3, 500, 1, 1404664325
)
,NumberingLogic AS
(
SELECT *
,SameStatus = ROW_NUMBER() OVER (PARTITION BY Product_Id, [Status] ORDER BY [Date])
,MaxPrice = ROW_NUMBER() OVER (PARTITION BY Product_Id ORDER BY [Date] DESC)
FROM Samples
)
SELECT Product_Id
,Price
,SameStatus
FROM NumberingLogic
WHERE MaxPrice = 1
PS.
How your dates work is a little unclear to me, but I have used them to order by

Related

How to calculate using postgres the customer rebooking rate

I need to calculate the re-bookings rate percentage by teacher for a given time range, my table structure is fairly standard for an e-commerce platform, tables include booking, customer and teacher, included below:
Re-bookings rate: a customer can create a booking for any teacher but each booking can only have one teacher at a time. The re-bookings rate is when a teacher is booked more than once by the same customer. The teacher with the highest percentage means the teacher has more repeat bookings than any other teacher.
I wondered whether a temporary table needs to be created for each teacher with a list bookings?
table: Booking
id, customer_id, teacher_id
table: Customer
id, email, forename, surname
table: Teacher
id, name, email
I have the following query but I'm not sure how to calculate the re-bookings rate!?
SELECT t.id, t.email
FROM booking b
JOIN customer c on c.id = b.customer_id
JOIN teacher t on t.id = b.teacher_id
WHERE b.time_created = "..."
Sample dataset
Booking
id, customer_id, teacher_id
1, 1, 1
2, 2, 1
3, 6, 2
4, 8, 4
5, 4, 3
6, 1, 1
Customer
id, email, forename, surname
1, tom#test.com, tom, smith
2, rachel#test.com, rachel, green
3, jeff#lycos.com, jeff, price
4, max#google.com, max, cooper
5, tom#msn.com, tom, white
6, pete#gmail.com, pete, tinner
7, lenny#hotmail.com, lenny, allen
8, noel#gmail.com, noel, ashton
Teacher
id, name, email
1, john, john#schoolexample1.edu
2, gavin, gavin#schoolexample1.edu
3, gordon, gordon#schoolexample1.edu
4, hazel, hazel#schoolexample1.edu
Desired data output
Latest query version update
My thoughts on this:
SELECT
t_ct.teacher_id, teacher.forename, t_ct.customer_id, customer.forename, ct,
SUM(ct) OVER(PARTITION BY t_ct.teacher_id)
FROM
(SELECT
teacher_id, customer_id, count(*) AS ct
FROM
booking
JOIN
teacher
ON
booking.teacher_id = teacher.id
GROUP BY
teacher_id, customer_id) AS t_ct
JOIN
teacher
ON
t_ct.teacher_id = teacher.id
JOIN
customer
ON
t_ct.customer_id = customer.id
ORDER BY
teacher.forename, ct desc;
teacher_id | forename | customer_id | forename | ct | sum
------------+----------+-------------+----------+----+-----
4 | jennifer | 1 | oliver | 1 | 1
3 | max | 1 | oliver | 1 | 2
3 | max | 2 | chandler | 1 | 2
2 | terry | 2 | chandler | 3 | 5
2 | terry | 3 | sarah | 1 | 5
2 | terry | 4 | vicky | 1 | 5
Changing the ORDER BY:
sum desc, ct desc, teacher.forename;
would get a better picture:
teacher_id | forename | customer_id | forename | ct | sum
------------+----------+-------------+----------+----+-----
2 | terry | 2 | chandler | 3 | 5
2 | terry | 3 | sarah | 1 | 5
2 | terry | 4 | vicky | 1 | 5
3 | max | 1 | oliver | 1 | 2
3 | max | 2 | chandler | 1 | 2
4 | jennifer | 1 | oliver | 1 | 1
The SUM(ct) OVER(PARTITION BY t_ct.teacher_id) portion is a Window function that calculates the sum of bookings a teacher has, per 'total_number_of_bookings' request from the dbfiddle. Basically it sums the bookings per teacher_id. It repeats for each row for a given teacher_id.
UPDATE 12/10/2020. Using tablefunc module to do a crosstab on a temporary table. Not complete as it does not deal with 4+ bookings.
CREATE TEMP TABLE booking_rate AS
SELECT
t_ct.teacher_id, teacher.forename, ct,
SUM(ct) OVER(PARTITION BY t_ct.teacher_id) AS total
FROM
(select teacher_id, ct, count(*) from (SELECT
teacher_id, customer_id, count(*) AS ct
FROM
booking
JOIN
teacher
ON
booking.teacher_id = teacher.id
GROUP BY
teacher_id, customer_id) as t_c_ct group by teacher_id, ct) AS t_ct
JOIN
teacher
ON
t_ct.teacher_id = teacher.id
ORDER BY
teacher.forename, ct desc;
select * from booking_rate ;
teacher_id | forename | ct | total
------------+----------+----+-------
4 | jennifer | 1 | 1
3 | max | 1 | 1
2 | terry | 3 | 4
2 | terry | 1 | 4
SELECT t_id, name, total, round(coalesce(ct/total::numeric, 0), 2) * 100 AS one,
round(coalesce(ct2/total::numeric, 0), 2) * 100 AS two, round(coalesce(ct3/total::numeric, 0) ,2) * 100 as three,
round(coalesce(ct4/total::numeric, 0), 2) * 100 AS four
FROM
(SELECT * FROM crosstab (
'SELECT teacher_id, forename, total, ct, ct FROM booking_rate ORDER BY 1',
'SELECT booking from generate_series(1,4) AS booking'
) AS (
t_id int,
name varchar,
total int,
ct int,
ct2 int,
ct3 int,
ct4 int
)
)AS c_tab
;
t_id | name | total | one | two | three | four
------+----------+-------+--------+------+-------+------
2 | terry | 4 | 25.00 | 0.00 | 75.00 | 0.00
3 | max | 1 | 100.00 | 0.00 | 0.00 | 0.00
4 | jennifer | 1 | 100.00 | 0.00 | 0.00 | 0.00
(3 rows)

Select a MAX when using a window function that orders with nulls first in Postgres

I have a query that I'm attempting to optimize and running into some unexpected behavior.
For example, take a table like this:
CREATE TABLE objects
(id int, external_id int, timestamp timestamp);
INSERT INTO objects
(id, external_id, timestamp)
VALUES
(1, 1, '2019-08-16 12:00:00'),
(2, 1, NULL),
(3, 2, '2019-08-16 12:00:00'),
(4, 2, NULL);
I use a query as so to partition the objects by their external_id and then select the maximum value for timestamp across the partition:
SELECT *,
max(timestamp) OVER (PARTITION BY external_id) as max_timestamp,
row_number() OVER (PARTITION BY external_id ORDER BY timestamp ASC NULLS FIRST, id ASC)
FROM objects;
This produces the following (which is what I'm looking for):
[Results]:
| id | external_id | timestamp | max_timestamp | row_number |
|----|-------------|----------------------|----------------------|------------|
| 2 | 1 | (null) | 2019-08-16T12:00:00Z | 1 |
| 1 | 1 | 2019-08-16T12:00:00Z | 2019-08-16T12:00:00Z | 2 |
| 4 | 2 | (null) | 2019-08-16T12:00:00Z | 1 |
| 3 | 2 | 2019-08-16T12:00:00Z | 2019-08-16T12:00:00Z | 2 |
I want to remove the multiple window functions in favor of a single one. For example:
SELECT *,
max(timestamp) OVER w as max_timestamp,
row_number() OVER w
FROM objects
WINDOW w AS (PARTITION BY external_id ORDER BY timestamp ASC NULLS FIRST, id ASC);
However, doing this produces a different result with the max_timestamp set to NULL for half the results:
[Results]:
| id | external_id | timestamp | max_timestamp | row_number |
|----|-------------|----------------------|----------------------|------------|
| 2 | 1 | (null) | (null) | 1 |
| 1 | 1 | 2019-08-16T12:00:00Z | 2019-08-16T12:00:00Z | 2 |
| 4 | 2 | (null) | (null) | 1 |
| 3 | 2 | 2019-08-16T12:00:00Z | 2019-08-16T12:00:00Z | 2 |
Why would introducing a sort order to the window function have any effect on the return of max?
http://sqlfiddle.com/#!17/1fcc4/4

Update using subquery sets same value for all records

I'm trying to calculate the weight of each record based on the value of a column (updated_at). When I run the following query:
UPDATE buyers
SET weight = RankedRecords.rank / (RankedRecords.count + 1.0)
FROM (
SELECT
id,
RANK() OVER (
PARTITION BY board_list_id ORDER BY 'updated_at' ASC
) AS rank,
COUNT(id) OVER (PARTITION BY board_list_id) AS count
FROM buyers
) RankedRecords
WHERE buyers.id = RankedRecords.id
All records with the same board_list_id get their weight updated to the same value. While I expect all weight values to be different and depend on rank.
Running just the subquery produces correct results (each record has different rank). But updating doesn't work as expected.
What should I change?
You have a very subtle mistake in your query. Try this instead:
UPDATE
buyers
SET
weight = RankedRecords.rank / (RankedRecords.count + 1.0)
FROM
(
SELECT
id,
rank() OVER (PARTITION BY board_list_id ORDER BY updated_at ASC) AS rank,
count(id) OVER (PARTITION BY board_list_id) AS count
FROM buyers
) RankedRecords
WHERE
buyers.id = RankedRecords.id ;
Your litle mistake: ORDER BY 'updated_at' is just ORDER BY 'constant-text'. If you want to refer to the column, you either use "updated_at" (with double quotes) or updated_at (without them, because the name of your column is just ASCII lowercase chars).
Tried with:
CREATE TABLE buyers
(
id integer not null primary key,
board_list_id integer not null,
updated_at timestamp not null default now(),
weight double precision
) ;
INSERT INTO buyers (id, board_list_id, updated_at)
VALUES
(1, 1, '2017-01-09'),
(2, 1, '2017-01-10'),
(3, 1, '2017-01-11'),
(4, 1, '2017-01-12'),
(5, 2, '2017-01-09'),
(6, 2, '2017-01-10'),
(7, 2, '2017-01-11'),
(8, 1, '2017-01-12') ;
The result of the previous UPDATE (with a RETURNING * clause) would be:
|----+---------------+---------------------+--------+----+------+-------|
| id | board_list_id | updated_at | weight | id | rank | count |
|----+---------------+---------------------+--------+----+------+-------|
| 1 | 1 | 2017-01-09 00:00:00 | 0.1667 | 1 | 1 | 5 |
|----+---------------+---------------------+--------+----+------+-------|
| 2 | 1 | 2017-01-10 00:00:00 | 0.3333 | 2 | 2 | 5 |
|----+---------------+---------------------+--------+----+------+-------|
| 3 | 1 | 2017-01-11 00:00:00 | 0.5 | 3 | 3 | 5 |
|----+---------------+---------------------+--------+----+------+-------|
| 8 | 1 | 2017-01-12 00:00:00 | 0.6667 | 8 | 4 | 5 |
|----+---------------+---------------------+--------+----+------+-------|
| 4 | 1 | 2017-01-12 00:00:00 | 0.6667 | 4 | 4 | 5 |
|----+---------------+---------------------+--------+----+------+-------|
| 5 | 2 | 2017-01-09 00:00:00 | 0.25 | 5 | 1 | 3 |
|----+---------------+---------------------+--------+----+------+-------|
| 6 | 2 | 2017-01-10 00:00:00 | 0.5 | 6 | 2 | 3 |
|----+---------------+---------------------+--------+----+------+-------|
| 7 | 2 | 2017-01-11 00:00:00 | 0.75 | 7 | 3 | 3 |
|----+---------------+---------------------+--------+----+------+-------|

1th and 7th row in grouping

I have this table named Samples. The Date column values are just symbolic date values.
+----+------------+-------+------+
| Id | Product_Id | Price | Date |
+----+------------+-------+------+
| 1 | 1 | 100 | 1 |
| 2 | 2 | 100 | 2 |
| 3 | 3 | 100 | 3 |
| 4 | 1 | 100 | 4 |
| 5 | 2 | 100 | 5 |
| 6 | 3 | 100 | 6 |
...
+----+------------+-------+------+
I want to group by product_id such that I have the 1'th sample in descending date order and a new colomn added with the Price of the 7'th sample row in each product group. If the 7'th row does not exist, then the value should be null.
Example:
+----+------------+-------+------+----------+
| Id | Product_Id | Price | Date | 7thPrice |
+----+------------+-------+------+----------+
| 4 | 1 | 100 | 4 | 120 |
| 5 | 2 | 100 | 5 | 100 |
| 6 | 3 | 100 | 6 | NULL |
+----+------------+-------+------+----------+
I belive I can achieve the table without the '7thPrice' with the following
SELECT * FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY Product_Id ORDER BY date DESC) r, * FROM Samples
) T WHERE T.r = 1
Any suggestions?
You can try something like this. I used your query to create a CTE. Then joined rank1 to rank7.
;with sampleCTE
as
(SELECT ROW_NUMBER() OVER (PARTITION BY Product_Id ORDER BY date DESC) r, * FROM Samples)
select *
from
(select * from samplecte where r = 1) a
left join
(select * from samplecte where r=7) b
on a.product_id = b.product_id

how to group rows but only on non disjointed date ranges in sql server t-sql

I am importing a subset of records and columns from a source table to a local table. I am trying to collapse the data so that I can have unique rows in my table, but only when the date ranges are consecutive. I am having trouble as I can't figure out how to do the grouping such that I don't jump over date ranges. Here is some sample data:
|PID | GroupID | Data | StartDate | EndDate
| 12 | 1 | 4 | 45 | 50
| 11 | 1 | 5 | 40 | 45
| 10 | 1 | 5 | 35 | 40
| 9 | 1 | 4 | 30 | 35
| 8 | 2 | 5 | 25 | 50
| 7 | 1 | 4 | 25 | 30
| 6 | 1 | 4 | 20 | 25
| 5 | 1 | 2 | 15 | 20
| 4 | 1 | 3 | 10 | 15
| 3 | 1 | 3 | 5 | 10
| 2 | 2 | 1 | 1 | 25
| 1 | 1 | 2 | 1 | 5
I am trying to get this result:
|GroupID | HistoryID | Data | StartDate | EndDate
| 1 | 1 | 4 | 45 | 50
| 1 | 2 | 5 | 35 | 45
| 1 | 3 | 4 | 20 | 35
| 1 | 4 | 2 | 15 | 20
| 1 | 5 | 3 | 5 | 15
| 1 | 6 | 2 | 1 | 5
| 2 | 1 | 5 | 25 | 50
| 2 | 2 | 1 | 1 | 25
So imagine there are thousands of group ids, the data column is actually multiple columns and start/end date are actual dates.
What I was trying to do is some kind of solution by self joining on startDate and endDate and comparing data, or doing some kind of partion by groupid and grouping on data. Then taking the minimum startDate and maximum endDate. However, I cant figure out a way to do it such that data 4 doesn't go from 20 startdate to 50 enddate and overlap the date range for data 5.
I know in Sql Server 2012 there is new stuff for lookahead rows and running totals but I'm implementing in Sql Server 2008. Any ideas?
If gaps between ranges are not possible, or if they don't break a group then:
------------
-- test data
------------
declare #data table
(
Pid int,
GroupID int,
Data int,
StartDate int,
EndDate int
)
insert into #data (Pid, GroupID, Data, StartDate, EndDate)
values
(10, 1, 4, 45, 50),
(9, 1, 5, 40, 45),
(8, 1, 5, 35, 40),
(7, 1, 4, 30, 35),
(6, 1, 4, 25, 30),
(5, 1, 4, 20, 25),
(4, 1, 2, 15, 20),
(3, 1, 3, 10, 15),
(2, 1, 3, 5, 10),
(1, 1, 2, 1, 5)
-----------
-- solution
-----------
select
GroupID, Data, StartDate = min(StartDate), EndDate = max(EndDate)
from
(
select
*,
rn1 = row_number() over(order by StartDate),
rn2 = row_number() over(partition by GroupID, Data order by StartDate desc)
from #data
) t
group by GroupID, Data, rn1 + rn2
order by StartDate desc
otherwise, please let me know.