How to calculate average date occurrence frequency in SQL - tsql

I'm trying to produce a query on the following table (relevant portion only):
Create Table [Order] (
OrderID int NOT NULL IDENTITY(1,1),
CreationDate datetime NOT NULL,
CustomerID int NOT NULL
)
I would like to see a list of CustomerIDs with each customer's average number of days between orders. I'm curious if this can be done with a pure set based solution or if a cursor/temp table solution is necessary.

;WITH base AS
(
SELECT CustomerID,
ROW_NUMBER() over (partition BY CustomerID ORDER BY CreationDate, OrderID) AS rn
FROM [Order]
)
SELECT b1.CustomerID,
AVG(DATEDIFF(DAY,b1.CreationDate, b2.CreationDate) )
FROM base b1
JOIN base b2
ON b1.CustomerID=b2.CustomerID
AND b2.rn =b1.rn+1
GROUP BY b1.CustomerID

Related

join vs subquery in searches between ranges

subquery vs query's performance is an inaccurate science, google shows cases where both are advantageous, it depends on the data structure, it's necessary to test both to arrive at your truth.
I have a subquery that I couldn't replace by join and be able to test its performance.
Assume that you have a price history table, you add records every time the price or its characteristics change, take this simple example: sql fiddle simple sample!
create table price_hist
( hid serial,
product int,
start_day date,
price numeric,
max_discount numeric,
promo_code character(4) );
create table deliveries
( del_id serial,
del_date date,
product int,
quantity int,
u_price numeric);
insert into price_hist (product, start_day,price,max_discount,promo_code)
values
(21,'2018-03-14',56.22, .022, 'Sam2'),
(18,'2018-02-24',11.25, .031, 'pax3'),
(21,'2017-12-28',50.12, .019, 'titi'),
(21,'2017-12-01',51.89, .034, 'any7'),
(18,'2017-12-26',11.52, .039, 'jun3'),
(18,'2017-12-10',10.99, .029, 'sep9');
insert into deliveries(del_date, product, quantity)
values
('2017-12-05',21,4),
('2017-12-20',18,3),
('2017-12-28',21,2),
('2018-05-08',18,1),
('2018-08-20',21,5);
select d.del_id, d.del_date, d.product, d.quantity,
(select price from price_hist h where h.product=d.product order by h.start_day desc limit 1) u_price,
(select max_discount from price_hist h where h.product=d.product order by h.start_day desc limit 1) max_discount,
(select price from price_hist h where h.product=d.product order by h.start_day desc limit 1)*d.quantity total
from deliveries d;
subqueries find values between date ranges, I have not been able to do in postgresql the join that does the same
You can use distinct on to get values from price_hist for the latest start_day:
select distinct on(product)
product, price, max_discount
from price_hist h
order by product, start_day desc
product | price | max_discount
---------+-------+--------------
18 | 11.25 | 0.031
21 | 56.22 | 0.022
(2 rows)
Use it as a derived table to join it with deliveries:
select
d.del_id, d.del_date, d.product, d.quantity,
h.price as u_price, h.max_discount, h.price * d.quantity as total
from deliveries d
join (
select distinct on(product)
product, price, max_discount
from price_hist
order by product, start_day desc
) h using(product)
SqlFiddle.

How can I SUM distinct records in a Postgres database where there are duplicate records?

Imagine a table that looks like this:
The SQL to get this data was just SELECT *
The first column is "row_id" the second is "id" - which is the order ID and the third is "total" - which is the revenue.
I'm not sure why there are duplicate rows in the database, but when I do a SUM(total), it's including the second entry in the database, even though the order ID is the same, which is causing my numbers to be larger than if I select distinct(id), total - export to excel and then sum the values manually.
So my question is - how can I SUM on just the distinct order IDs so that I get the same revenue as if I exported to excel every distinct order ID row?
Thanks in advance!
Easy - just divide by the count:
select id, sum(total) / count(id)
from orders
group by id
See live demo.
Also handles any level of duplication, eg triplicates etc.
You can try something like this (with your example):
Table
create table test (
row_id int,
id int,
total decimal(15,2)
);
insert into test values
(6395, 1509, 112), (22986, 1509, 112),
(1393, 3284, 40.37), (24360, 3284, 40.37);
Query
with distinct_records as (
select distinct id, total from test
)
select a.id, b.actual_total, array_agg(a.row_id) as row_ids
from test a
inner join (select id, sum(total) as actual_total from distinct_records group by id) b
on a.id = b.id
group by a.id, b.actual_total
Result
| id | actual_total | row_ids |
|------|--------------|------------|
| 1509 | 112 | 6395,22986 |
| 3284 | 40.37 | 1393,24360 |
Explanation
We do not know what the reasons is for orders and totals to appear more than one time with different row_id. So using a common table expression (CTE) using the with ... phrase, we get the distinct id and total.
Under the CTE, we use this distinct data to do totaling. We join ID in the original table with the aggregation over distinct values. Then we comma-separate row_ids so that the information looks cleaner.
SQLFiddle example
http://sqlfiddle.com/#!15/72639/3
Create custom aggregate:
CREATE OR REPLACE FUNCTION sum_func (
double precision, pg_catalog.anyelement, double precision
)
RETURNS double precision AS
$body$
SELECT case when $3 is not null then COALESCE($1, 0) + $3 else $1 end
$body$
LANGUAGE 'sql';
CREATE AGGREGATE dist_sum (
pg_catalog."any",
double precision)
(
SFUNC = sum_func,
STYPE = float8
);
And then calc distinct sum like:
select dist_sum(distinct id, total)
from orders
SQLFiddle
You can use DISTINCT in your aggregate functions:
SELECT id, SUM(DISTINCT total) FROM orders GROUP BY id
Documentation here: https://www.postgresql.org/docs/9.6/static/sql-expressions.html#SYNTAX-AGGREGATES
If we can trust that the total for 1 order is actually 1 row. We could eliminate the duplicates in a sub-query by selecting the the MAX of the PK id column. An example:
CREATE TABLE test2 (id int, order_id int, total int);
insert into test2 values (1,1,50);
insert into test2 values (2,1,50);
insert into test2 values (5,1,50);
insert into test2 values (3,2,100);
insert into test2 values (4,2,100);
select order_id, sum(total)
from test2 t
join (
select max(id) as id
from test2
group by order_id) as sq
on t.id = sq.id
group by order_id
sql fiddle
In difficult cases:
select
id,
(
SELECT SUM(value::int4)
FROM jsonb_each_text(jsonb_object_agg(row_id, total))
) as total
from orders
group by id
I would suggest just use a sub-Query:
SELECT "a"."id", SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
GROUP BY "a"."id"
The Above will give you the total of each id
Use below if you want the full total of each duplicate removed:
SELECT SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
Using subselect (http://sqlfiddle.com/#!7/cef1c/51):
select sum(total) from (
select distinct id, total
from orders
)
Using CTE (http://sqlfiddle.com/#!7/cef1c/53):
with distinct_records as (
select distinct id, total from orders
)
select sum(total) from distinct_records;

Days since last purchase postgres (for each purchase)

Just have a standard orders table:
order_id
order_date
customer_id
order_total
Trying to write a query that generates a column that shows the days since the last purchase, for each customer. If the customer had no prior orders, the value would be zero.
I have tried something like this:
WITH user_data AS (
SELECT customer_id, order_total, order_date::DATE,
ROW_NUMBER() OVER (
PARTITION BY customer_id ORDER BY order_date::DATE DESC
)
AS order_count
FROM transactions
WHERE STATUS = 100 AND order_total > 0
)
SELECT * FROM user_data WHERE order_count < 3;
Which I could feed into tableau, then use some table calculations to wrangle the data, but I really would like to understand the SQL approach. My approach also only analyzes the most recent 2 transactions, which is a drawback.
Thanks
You should use lag() function:
select *,
lag(order_date) over (partition by customer_id order by order_date)
as prior_order_date
from transactions
order by order_id
To have the number of days since last order, just subtract the prior order date from the current order date:
select *,
order_date- lag(order_date) over (partition by customer_id order by order_date)
as days_since_last_order
from transactions
order by order_id
The query selects null if there is no prior order. You can use coalesce() to change it to zero.
You indicated that you need to calculate number of days since the last purchase.
..Trying to write a query that generates a column that shows the days
since the last purchase
So, basically you need get a difference between now and last purchase date for each client. Query can be the following:
-- test DDL
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
order_date DATE,
customer_id INTEGER,
order_total INTEGER
);
INSERT INTO orders(order_date, customer_id, order_total) VALUES
('01-01-2015'::DATE,1,2),
('01-02-2015'::DATE,1,3),
('02-01-2015'::DATE,2,4),
('02-02-2015'::DATE,2,5),
('03-01-2015'::DATE,3,6),
('03-02-2015'::DATE,3,7);
WITH orderdata AS (
SELECT customer_id,order_total,order_date,
(now()::DATE - max(order_date) OVER (PARTITION BY customer_id)) as days_since_purchase
FROM orders
WHERE order_total > 0
)
SELECT DISTINCT customer_id ,days_since_purchase FROM orderdata ORDER BY customer_id;

How to Calculate Gap Between two Dates in SQL Server 2005?

I have a data set as shown in the picture.
I am trying to get the date difference between eligenddate (First row) and eligstartdate (second row). I would really appreciate any suggestions.
Thank you
SQL2005:
One solution is to insert into a table variable (#DateWithRowNum - the number of rows is small) or into a temp table (#DateWithRowNum - the number of rows is high) the rows with a row number (generated using [elig]startdate as order by criteria; also see note #1) plus a self join thus:
DECLARE #DateWithRowNum TABLE (
memberid VARCHAR(50) NOT NULL,
rownum INT,
PRIMARY KEY(memberid, rownum),
startdate DATETIME NOT NULL,
enddate DATETIME NOT NULL
)
INSERT #DateWithRowNum (memberid, rownum, startdate, enddate)
SELECT memberid,
ROW_NUMBER() OVER(PARTITION BY memberid ORDER By startdate),
startdate,
enddate
FROM dbo.MyTable
SELECT crt.*, DATEDIFF(MONTH, crt.enddate, prev.startdate) AS gap
FROM #DateWithRowNum crt
LEFT JOIN #DateWithRowNum prev ON crt.memberid = prev.memberid AND crt.rownum - 1 = prev.rownum
ORDER BY crt.memberid, crt.rownum
Another solution is to use common table expression instead of table variable / temp table thus:
;WITH DateWithRowNum AS (
SELECT memberid,
ROW_NUMBER() OVER(PARTITION BY memberid ORDER By startdate),
startdate,
enddate
FROM dbo.MyTable
)
SELECT crt.*, DATEDIFF(MONTH, crt.enddate, prev.startdate) AS gap
FROM DateWithRowNum crt
LEFT /*HASH*/ JOIN DateWithRowNum prev ON crt.memberid = prev.memberid AND crt.rownum - 1 = prev.rownum
ORDER BY crt.memberid, crt.rownum
Note #1: I assume that you need to calculate these values for every memberid
Note #2: HASH hint forces SQL Server to evaluate just once every data source (crt or prev) of LEFT JOIN.

most efficient query to get the first and last record id in a large dataset

I need to write a query against a large dataset to get the first and last record id, plus the first record created time. The sample of the data is as following:
In the above case, if the category "Blue" is passed into the query as parameter, I will expect to return "A12, 13:00, E66" as the result of the query.
I can using aggregate function to get the max and min time from the dataset, and join to get the first and last record. But just wondering whehter there is a more effecient way to achieve the same output?
My advice would be to try to reduce the number of scan/seek operations by comparing execution plans and to place indexes on the categoryID (for lookup) and time (for sorting) columns.
If you have SQL Server 2008 or later, you could use the following, which requires two scans/seeks:
Declare #CategoryID As Varchar(16)
Set #CategoryID = 'Blue'
Select
First_Record.RecordID,
First_Record.CreatedTime,
Last_Record.RecordID
From
(
Select Top 1
RecordID,
CreatedTime
From
<Table>
Where
CategoryID = #CategoryID
Order By
CreatedTime Asc
) First_Record
Cross Apply
(
Select Top 1
RecordID
From
<Table>
Where
CategoryID = #CategoryID
Order By
CreatedTime Desc
) Last_Record
If you have SQL Server 2012 or later, you could write the following, which requires only one scan/seek:
Select Top 1
First_Value(RecordID) Over (Partition By CategoryID Order By CreatedTime Asc),
First_Value(CreatedTime) Over (Partition By CategoryID Order By CreatedTime Asc),
First_Value(RecordID) Over (Partition By CategoryID Order By CreatedTime Desc)
From
<Table>
Where
CategoryID = #CategoryID