Redshift PostgreSQL Distinct ON Operator

Redshift PostgreSQL Distinct ON Operator - postgresql

I have a data set that I want to parse for to see multi-touch attribution. The data set is made up by leads who responded to a marketing campaign and their marketing source.
Each lead can respond to multiple campaigns and I want to get their first marketing source and their last marketing source in the same table.
I was thinking I could create two tables and use a select statement from both.
The first table would attempt to create a table with the most recent marketing source from every person (using email as their unique ID).
create table temp.multitouch1 as (
select distinct on (email) email, date, market_source as last_source
from sf.campaignmember
where date >= '1/1/2016' ORDER BY DATE DESC);
Then I would create a table with deduped emails but this time for the first source.
create table temp.multitouch2 as (
select distinct on (email) email, date, market_source as first_source
from sf.campaignmember
where date >= '1/1/2016' ORDER BY DATE ASC);
Finally I wanted to simply select the email and join the first and last market sources to it each in their own column.
select a.email, a.last_source, b.first_source, a.date
from temp.multitouch1 a
left join temp.multitouch b on b.email = a.email
Since distinct on doesn't work on redshift's postgresql version I was hoping someone had an idea to solve this issue in another way.
EDIT 2/22: For more context I'm dealing with people and campaigns they've responded to. Each record is a "campaign response" and every person can have more than one campaign response with multiple sources. I'm trying make a select statement which would dedupe by person and then have columns for the first campaign/marketing source they've responded to and the last campaign/marketing source they've responded to respectively.
EDIT 2/24: Ideal output is a table with 4 columns: email, last_source, first_source, date.
The first and last source columns would be the same for people with only 1 campaign member record and different for everyone who has more than 1 campaign member record.

I believe you could use row_number() inside case expressions like this:
SELECT
email
, MIN(first_source) AS first_source
, MIN(date) first_date
, MAX(last_source) AS last_source
, MAX(date) AS last_date
FROM (
SELECT
email
, date
, CASE
WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
ELSE NULL
END AS first_source
, CASE
WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
ELSE NULL
END AS last_source
FROM sf.campaignmember
WHERE date >= '2016-01-01'
) s
WHERE first_source IS NOT NULL
OR last_source IS NOT NULL
GROUP BY
email
tested here: SQL Fiddle
PostgreSQL 9.3 Schema Setup:
CREATE TABLE campaignmember
(email varchar(3), date timestamp, market_source varchar(1))
;
INSERT INTO campaignmember
(email, date, market_source)
VALUES
('a#a', '2016-01-02 00:00:00', 'x'),
('a#a', '2016-01-03 00:00:00', 'y'),
('a#a', '2016-01-04 00:00:00', 'z'),
('b#b', '2016-01-02 00:00:00', 'x')
;
Query 1:
SELECT
email
, MIN(first_source) AS first_source
, MIN(date) first_date
, MAX(last_source) AS last_source
, MAX(date) AS last_date
FROM (
SELECT
email
, date
, CASE
WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
ELSE NULL
END AS first_source
, CASE
WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
ELSE NULL
END AS last_source
FROM campaignmember
WHERE date >= '2016-01-01'
) s
WHERE first_source IS NOT NULL
OR last_source IS NOT NULL
GROUP BY
email
Results:
| email | first_source | first_date | last_source | last_date |
|-------|--------------|---------------------------|-------------|---------------------------|
| a#a | x | January, 02 2016 00:00:00 | z | January, 04 2016 00:00:00 |
| b#b | x | January, 02 2016 00:00:00 | x | January, 02 2016 00:00:00 |
& a small extension to the request, count the number of contact points.
SELECT
email
, MIN(first_source) AS first_source
, MIN(date) first_date
, MAX(last_source) AS last_source
, MAX(date) AS last_date
, MAX(numof) AS Numberof_Contacts
FROM (
SELECT
email
, date
, CASE
WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date ASC) = 1 THEN market_source
ELSE NULL
END AS first_source
, CASE
WHEN ROW_NUMBER() OVER (PARTITION BY email ORDER BY date DESC) = 1 THEN market_source
ELSE NULL
END AS last_source
, COUNT(*) OVER (PARTITION BY email) as numof
FROM campaignmember
WHERE date >= '2016-01-01'
) s
WHERE first_source IS NOT NULL
OR last_source IS NOT NULL
GROUP BY
email

You can use the good old left join groupwise maximum.
SELECT DISTINCT c1.email, c1.date, c1.market_source
FROM sf.campaignmember c1
LEFT JOIN sf.campaignmember c2
ON c1.email = c2.email AND c1.date > c2.date AND c1.id > c2.id
LEFT JOIN sf.campaignmember c3
ON c1.email = c3.email AND c1.date < c3.date AND c1.id > c3.id
WHERE c1.date >= '1/1/2016' AND c2.date >= '1/1/2016'
AND (c2.email IS NULL OR c3.email IS NULL)
This assumes you have an unique id column, if (date, email) is unique id is not needed.

Related

PostgreSQL - SQL function to loop through all months of the year and pull 10 random records from each

I am attempting to pull 10 random records from each month of this year using this query here but I get an error "ERROR: relation "c1" does not exist
"
Not sure where I'm going wrong - I think it may be I'm using Mysql syntax instead, but how do I resolve this?
My desired output is like this
Month
Another header
2021-01
random email 1
2021-01
random email 2
total of ten random emails from January, then ten more for each month this year (til November of course as Dec yet to happen)..
With CTE AS
(
Select month,
email,
Row_Number() Over (Partition By month Order By FLOOR(RANDOM()*(1-1000000+1))) AS RN
From (
SELECT
DISTINCT(TO_CHAR(DATE_TRUNC('month', timestamp ), 'YYYY-MM')) AS month
,CASE
WHEN
JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'name') = 'email'
THEN
JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'value')
END AS email
FROM form_submits_y2 fs
WHERE fs.website_id IN (791)
AND month LIKE '2021%'
GROUP BY 1,2
ORDER BY 1 ASC
)
)
SELECT *
FROM CTE C1
LEFT JOIN
(SELECT RN
,month
,email
FROM CTE C2
WHERE C2.month = C1.month
ORDER BY RANDOM() LIMIT 10) C3
ON C1.RN = C3.RN
ORDER By month ASC```

You can't reference an outer table inside a derived table with a regular join. You need to use left join lateral to make that work

I did end up finding a more elegant solution to my query here via this source from github :
SELECT
month
,email
FROM
(
Select month,
email,
Row_Number() Over (Partition By month Order By FLOOR(RANDOM()*(1-1000000+1))) AS RN
From (
SELECT
TO_CHAR(DATE_TRUNC('month', timestamp ), 'YYYY-MM') AS month
,CASE
WHEN JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'name') = 'email'
THEN JSON_EXTRACT_PATH_TEXT(json_extract_array_element_text (form_data,0),'value')
END AS email
FROM form_submits_y2 fs
WHERE fs.website_id IN (791)
AND month LIKE '2021%'
GROUP BY 1,2
ORDER BY 1 ASC
)
) q
WHERE
RN <=10
ORDER BY month ASC

How to get number of consecutive days from current date using postgres?

I want to get the number of consecutive days from the current date using Postgres SQL.
enter image description here
Above is the scenario in which I have highlighted consecutive days count should be like this.
Below is the SQL query which I have created but it's not returning the expected result
with grouped_dates as (
select user_id, created_at::timestamp::date,
(created_at::timestamp::date - (row_number() over (partition by user_id order by created_at::timestamp::date) || ' days')::interval)::date as grouping_date
from watch_history
)
select * , dense_rank() over (partition by grouping_date order by created_at::timestamp::date) as in_streak
from grouped_dates where user_id = 702
order by created_at::timestamp::date
Can anyone please help me to resolve this issue?
If anyhow we can able to apply distinct for created_at field to below query then I will get solutions for my issue.
WITH list AS
(
SELECT user_id,
(created_at::timestamp::date - (row_number() over (partition by user_id order by created_at::timestamp::date) || ' days')::interval)::date as next_day
FROM watch_history
)
SELECT user_id, count(*) AS number_of_consecutive_days
FROM list
WHERE next_day IS NOT NULL
GROUP BY user_id
Does anyone have an idea how to apply distinct to created_at for the above mentioned query ?

To get the "number of consecutive days" for the same user_id :
WITH list AS
(
SELECT user_id
, array_agg(created_at) OVER (PARTITION BY user_id ORDER BY created_at RANGE BETWEEN CURRENT ROW AND '1 day' FOLLOWING) AS consecutive_days
FROM watch_history
)
SELECT user_id, count(DISTINCT d.day) AS number_of_consecutive_days
FROM list
CROSS JOIN LATERAL unnest(consecutive_days) AS d(day)
WHERE array_length(consecutive_days, 1) > 1
GROUP BY user_id
To get the list of "consecutive days" for the same user_id :
WITH list AS
(
SELECT user_id
, array_agg(created_at) OVER (PARTITION BY user_id ORDER BY created_at RANGE BETWEEN CURRENT ROW AND '1 day' FOLLOWING) AS consecutive_days
FROM watch_history
)
SELECT user_id
, array_agg(DISTINCT d.day ORDER BY d.day) AS list_of_consecutive_days
FROM list
CROSS JOIN LATERAL unnest(consecutive_days) AS d(day)
WHERE array_length(consecutive_days, 1) > 1
GROUP BY user_id
full example & result in dbfiddle

How to find the first and last date prior to a particular date in Postgresql?

I am a SQL beginner. I have trouble on finding the answer of this question
For each customer_id who made an order on January 1, 2006, what was their historical (prior to January 1, 2006) first and last order dates?
I've tried to solve it using a subquery. But I don't know how to find the first and last order dates prior to Jan 1.
Columns of table A:
customer_id
order_id
order_date
revenue
product_id
Columns of table B:
product_id
category_id
SELECT customer_id, order_date FROM A
(
SELECT customer_id FROM A
WHERE order_date = ‘2006-01-01’
)
WHERE ...

There are two subqueries actually. First for "For each customer_id who made an order on January 1, 2006" and second for "their historical (prior to January 1, 2006) first and last order dates"
So, first:
select customer_id from A where order_date = '2006-01-01';
and second:
select customer_id, min(order_date) as first_date, max(order_date) as last_date
from A
where order_date < '2006-01-01' group by customer_id;
Finally you need to get only those customers from second subquery who exists in the first one:
select customer_id, min(order_date) as first_date, max(order_date) as last_date
from A as t1
where
order_date < '2006-01-01' and
customer_id in (
select customer_id from A where order_date = '2006-01-01')
group by customer_id;
or, could be more efficient:
select customer_id, min(order_date) as first_date, max(order_date) as last_date
from A as t1
where
order_date < '2006-01-01' and
exists (
select 1 from A as t2
where t1.customer_id = t2.customer_id and t2.order_date = '2006-01-01')
group by customer_id;

You can use conditionals in aggregate functions:
SELECT customer_id, MIN(order_date) AS first, MAX(order_date) AS last FROM A
WHERE customer_id IN (SELECT customer_id FROM A WHERE order_date = ‘2006-01-01’) AND order_date < '2006-01-01'
GROUP BY customer_id;

how can i calculate sum of PaidAmount that is made between two dates

I have 3 tables which are Accounts, Payments, Statements. Table Accounts have all the accounts, table Payments have all the payments made to the account, and table Statements have all the statement data for the accounts.
Accounts
AccountID | DateOfDeath |
1001 | 2014-03-10 |
Payments
AccountID | PaidAmount | PaymentDate
1001 | 80.27 | 2014-07-09
1001 | 80.27 | 2014-06-10
1001 | 80.27 | 2014-05-12
1001 | 80.27 | 2014-04-13
1001 | 80.27 | 2014-03-15
1001 | 80.27 | 2014-02-14
Statements
AccountID | Balance | StatementDate
1001 | 0.00 | 2014-03-28
1001 | 1909.31 | 2014-02-25
I need to know the sum of PaidAmount (table Payments) in Payments table which is between the StatementDate (table Statements) of 2014-03-28 and 2014-02-25. The sum of the PaidAmount should have been 80.27 but I am getting 321.08. Can anyone tell me what I am doing wrong or how can I write the query in a better way?
here is what I have so far
create table #temp1
(
AccountID Numeric(9, 0)
, DateOfDeath date
, StatementDate date
, Balance numeric(17,2)
)
insert into #temp1
(
AccountID, DateOfDeath, StatementDate, Balance
)
select a.AccountID
,DateofDeath
,StatementDate
,Balance
from Accounts a
inner join Statements b on a.accountID = b.accountID
where StatementDate in (select top 1 statementdate
from Statements
where AccountID = a.AccountID
and StatementDate >= DateOfDeath
order by StatementDate)
Order By a.AccountID, StatementDate
create table #temp2
(
AccountId Numeric(9,0)
, PaidAmount Numeric(10, 2)
, PaymentDate date
)
select a.accountid, sum(a.Paidamount), max(a.PaymentDate)
from tblCreditDefenseInceptionToDateBenefit a
inner join #temp1 b on a.accountid = b.accountid
where a.paymentdate <= (select top 1 StatementDate from Statements
where AccountID = a.accountid
and statementdate >= b.dateofdeath
order by StatementDate desc)
and a.paymentdate > (select top 1 StatementDate from Statements
where AccountID = a.accountid
and statementdate < b.dateofdeath
order by StatementDate desc)
group by a.accountid
order by a.accountid desc
select * from #temp2
drop table #temp1
drop table #temp2

you can go about it a few ways
Create table #accounts
(AccountID int, Date_Death date)
insert into #accounts
(accountID, Date_death)
values
('1001', '03/10/2014')
Create Table #payments
(AccountID int, paidamt decimal(6,2), paymentdt date)
insert into #payments
(AccountID , paidamt, paymentdt)
values
('1001', '80.27','07/09/2014'),
('1001', '80.27','06/10/2014'),
('1001', '80.27','05/12/2014'),
('1001', '80.27','04/13/2014'),
('1001', '80.27','03/15/2014'),
('1001', '80.27','02/14/2014')
;
with cte as (
select
Accountid,
case when paymentdt between '02/25/2014'and '03/28/2014' then (paidamt) else null end as paidamt
from
#payments
)
Select
accountid,
SUM(paidamt)
from cte
group by
AccountID
or
put it in the where clause instead of doing a case statement, really depends onyour style
select
accountid,
sum(paidamt)paidamt
from
#payments
where paymentdate >= '02/25/2014'
and paymentdate <= '03/282014'
or
if you want to use the statement table dates as parameters
with cte as
(
select
a.AccountID,
case when a.paymentdt between b.min_dt and b.max_dt then a.paidamt else null end as 'pdamt'
from
#payments as a
inner join
(select accountid, MIN(statementdt)min_dt, MAX(statementdt)max_dt from #statement group by accountid) as b on b.accountid = a.AccountID
)
select
AccountID,
SUM(pdamt) as 'Paid Amount'
from
cte
group by
AccountID
again, could be added in where clase if you dontwant to do case staements

Grouping SQL results by continous time intervals (oracle sql)

I have following data in the table as below and I am looking for a way to group the continuous time intervals for each id to return:
CREATE TABLE DUMMY
(
ID VARCHAR2(10 BYTE),
TIME_STAMP VARCHAR2(8 BYTE),
NAME VARCHAR2(255 BYTE)
);
SELECT ID, min(TIME_STAMP) "startDate", max(TIME_STAMP) "endDate", NAME
GROUP BY ID , NAME
something like
100 20011128 20011203 David
100 20011204 20011207 Unknown
100 20011208 20011215 David
100 20011216 20011220 Sara
and so on ...
ps. I have a sample script, but i don't know how to attach my file.
Hi every one here is more input:
There is only one record with time_stamp for a specific ID.
Users can be different, for example for day 1 David, day 2 unknown, day 3 David and so on.
So there is one row for every day of year for each ID but with different users.
Now, i want to see the break point, differences base on time_stamp intervals from day one
until last day for a specific ID in day order from begin day until last day.
Query Result should be :
ID NAME MIN_DATE MAX_DATE
100 David 20011128 20050407
100 Sara 20050408 20050417
100 David 20050418 20080416
100 Unknown 20080417 20080507
100 David 20080508 20080508
100 Unknown 20080509 20080607
100 David 20080608 20080608
100 Unknown 20080609 20080921
100 David 20080922 20080922
100 Unknown 20080923 20081231
100 David 20090101 20090405
thanks
Hi again, many thanks to everyone, i have solved the problem, here is the solution:
select id, min(time_stamp), max(time_stamp), name
from ( select id, time_stamp, name,
max(rn) over (order by time_stamp) grp
from ( select id, time_stamp, name,
case
when lag(name) over (order by time_stamp) <> name or
row_number() over (order by time_stamp) = 1
then row_number() over (order by time_stamp)
end rn
from dummy
)
)
group by id, grp, name
order by 1

Select
ID,
Name,
min(time_stamp) min_date,
max(time_stamp) max_date
from
Dummy
group by
Id,
Name
That should work.
IF you want the date range for each Id, but all the names you can do:
Select
d.Id,
d.Name,
dr.min_date,
dr.max_date
from
Dummy d
JOIN
(Select
Id,
min(time_stamp) min_date,
max(time_stamp) max_date
from
Dummy
group by
Id
) dr
on ( dr.Id = d.Id)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse