Find preceding row / entry under specific conditions in SQL / Redshift - amazon-redshift

I am trying to find the preceding row to a specific occurrence in a database, or rather some data from it.
In this example I would like to find the movement_method of the preceding row (sorted by timestamp) before a user visited the pub. So in tom's example I would like to know that tom went home by car before visiting the pub. (it doesn't matter how he traveled to the pub but rather the used method before going to the pub)
I have an example database with: user, location, movement_method, timestamp:
user
location
movement_method
timestamp
tom
work
car
2022-03-02 14:30
tom
home
car
2022-03-02 20:30
tom
pub
bus
2022-03-02 22:30
tom
home
foot
2022-03-03 02:30
jane
school
bus
2022-03-02 08:30
jane
home
bus
2022-03-02 14:30
jane
pub
foot
2022-03-02 21:30
jane
home
bus
2022-03-02 23:30
lila
work
bus
2022-03-02 08:30
lila
home
bus
2022-03-02 16:30
jake
friend
car
2022-03-02 15:30
jake
home
bus
2022-03-02 20:30
jake
pub
car
2022-03-02 20:30
jake
home
car
2022-03-03 02:30
For this database the result I would want would be:
| user | preceding_movement_method |
| ---- | ------- |
| tom | car |
| jane | bus |
| jake | bus |
lila is not being reported because she never visited the pub
I only need to know the preceding movement_method before visiting the pub (sorted by time)
the movement_method which was used to go to the pub is not relevant
My current approach is to have a partition or window function for "preceding_movement_method" but I'm stuck finding the "preceding" entry before the one which fits the where statement.
So I'm looking for something like this pseudocode:
select user,
(select preceding movement_method
from movement_database
where location = 'pub'
order by timestamp) as preceding_movement_method
from movement_database

The LAG() window function is where I would go with this. I set the (sqlfiddle) data up as:
create table movements (
uname varchar(16),
location varchar(16),
movement_method varchar(16),
ts timestamp
);
insert into movements values
('tom', 'work', 'car', '2022-03-02 14:30'),
('tom', 'home', 'car', '2022-03-02 20:30'),
('tom', 'pub', 'bus', '2022-03-02 22:30'),
('tom', 'home', 'foot', '2022-03-03 02:30'),
('jane', 'school', 'bus', '2022-03-02 08:30'),
('jane', 'home', 'bus', '2022-03-02 14:30'),
('jane', 'pub', 'foot', '2022-03-02 21:30'),
('jane', 'home', 'bus', '2022-03-02 23:30'),
('lila', 'work', 'bus', '2022-03-02 08:30'),
('lila', 'home', 'bus', '2022-03-02 16:30'),
('jake', 'friend', 'car', '2022-03-02 15:30'),
('jake', 'home', 'bus', '2022-03-02 20:30'),
('jake', 'pub', 'car', '2022-03-02 20:30'),
('jake', 'home', 'car', '2022-03-03 02:30');
And the SQL as:
select uname, pmove
from (
select uname, location,
lag (movement_method) over (partition by uname order by ts) as pmove
from movements) as subq
where location = 'pub';
Now many of the timestamps for Jake are all the same so there is some uncertainty there.
I'd stay away from cross joins / loop joins since you are in Redshift and this implies very large datasets and these processes can explode with such large data.

Well, I'm not sure if it's a typo but the user jake has an IDENTICAL timestamp at home as at pub's which is an unlikely event. The code may seem a bit complicated, but it does take the problem into consideration.
select t1.`user`, movement_method from movement t1 join
(select m.`user`, max(m.`timestamp`) mx from movement m
join
(select `user`,`timestamp` from movement where location ='pub') t
on m.`user` = t.`user`
where m.`timestamp` <=t.`timestamp` and m.`location`!='pub'
group by `user`) t2
on t1.`user`=t2.`user` and t1.`timestamp`=mx and t1.location!='pub';

Related

T-SQL datepart, counts and unions

I am having issues with counting the number of events by date and hour that are recorded recorded across multiple tables.
I have a system manufacturer's database with multiple 'events' tables that are all formatted identically (same number of columns and data types in the same order) that each hold around 100,000 transaction events that looks like this:
EventID EventTimestamp EventType EventSubType UnitGuid DeviceGuid
1 2022-04-16 15:14:43.000 515 0 AAAA BBBB
2 2022-04-16 15:14:44.000 520 0 AAAA CCCC
3 2022-04-16 15:14:44.000 520 0 AAAA BBBB
Because each table holds ~100,000 records, events that occur on a single day can be spread over one or more tables.
I am interested in obtaining a count of the total number of events per hour, per day which I am able to do on a table by table basis with the following query:
select DATEPART(DAY,EventTimestamp) as 'event date', DATEPART(HOUR,EventTimestamp) as 'event hour', count(*) as 'number of events'
from Events_70
group by datepart(day,eventtimestamp), datepart(hour,eventtimestamp)
order by datepart(day,eventtimestamp), datepart(hour,eventtimestamp)
which produces output like:
event date event hour number of events
16 15 3966
16 16 4530
16 17 4357
... ... ...
I've been able to consolidate the data for multiple days in excel with a little manual work but there's got to be better way in SQL...
when i try to union two tables together with the following query:
select DATEPART(DAY,EventTimestamp) as 'event date', DATEPART(HOUR,EventTimestamp) as 'event hour', count(*) as 'number of events'
from Events_70
union all
select DATEPART(DAY,EventTimestamp) as 'event date', DATEPART(HOUR,EventTimestamp) as 'event hour', count(*) as 'number of events'
from Events_71
group by datepart(day,eventtimestamp), datepart(hour,eventtimestamp)
order by datepart(day,eventtimestamp), datepart(hour,eventtimestamp)
I am met with:
Column 'Events_70.EventTimestamp' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Msg 104, Level 16, State 1, Line 85
ORDER BY items must appear in the select list if the statement contains a UNION, INTERSECT or EXCEPT operator.
Msg 104, Level 16, State 1, Line 85
ORDER BY items must appear in the select list if the statement contains a UNION, INTERSECT or EXCEPT operator.
I'm also guessing that a sum of counts for the same hour and day across two tables might be required but i've not got that far yet where one table's output looks like:
event date event hour number of events
19 21 2460
19 22 1963
**19 23 435**
And the next table's output looks like:
event date event hour number of events
**19 23 1057**
20 00 867
20 01 930
I've been searching around various forums this morning and haven't found a solution, any help would be appreciated.

SQL Subquery for each

I have following tables
create table players
(
name varchar(30) not null primary key,
);
create table injuries
bId int not null primarykey,
date DATE not null,
name varchar(30),
foreign key(name) references players
);
create table sportsBegins
(
cId int not null primarykey,
date DATE,
sportname varchar(20),
name varchar(30)
foreign key(name) references players
);
Following example data:
players
name
John
Jane
George
shows players in db
sportsBegins
cId | date | sportname | name
1 2020-01-01 Basketball John
2 2020-02-02 Basketball John
3 2020-01-01 Soccer John
4 2020-02-02 Basketball Jane
5 2020-01-03 Basketball George
6 2020-01-04 Badminton George
shows what date players begin playing a sport
injuries
bId | date | name
1 2020-01-01 John
2 2020-02-03 Jane
3 2020-01-05 George
shows the date these players reported injuries.
I want to count the number of DISTINCT players that have experienced an injury in Basketball AFTER the first day they got assigned the sport (not the same day).
So for each player, i need to only grab the first date they started playing basketball. Then for that player, i need to compate his name AND date to the name AND date in the injuries table to see if he ever reported an injury after the date he got the sport assigned.
Example
In the example data I provided this would be the output
Total basketball injuries
2
Explanation of answer
John got assigned basketball twice. Only look at first date he got assigned basketball. Then look at injuries table. He only reported an injury on that day, but never after, so ignore. Jane and George reported injuries after first day assigned basketball so count them
This should get you the desired result
SELECT count(distinct injuries.name)
FROM injuries
INNER JOIN (SELECT name, min(date) as startDate FROM sportsBegins WHERE sportname = 'Basketball' GROUP BY name) as startDates ON injuries.name=startDates.name and injuries.date > startDates.startDate
Quick explanation:
startDates extracts the first date each player started playing basketball
the join condition filters only injuries which happened after the first start date for each player
count(distinct injuries.name) ensures each player only gets counted once even if he/she reported more than one injury after the first start date

PostgreSQL: calculation between first and last

I'm trying to write a query that calculates the number of days between the first and last score per id.
The data sample:
id date score
11 1/1/2017 25.34
4 1/2/2017 34.34
25 1/2/2017 15.78
4 3/2/2017 47.2
25 7/3/2017 65.21
11 9/3/2017 96.09
25 10/3/2017 11.3
4 10/3/2017 27.12
Which is far from what I need, but I'm really lost. Clueless to be honest. Any idea?
Thanks
Try this:
SELECT
customer_id,
date(last_score) - date(first_score) AS days_between_last_and_first_score,
total_score::float/(date(last_score) - date(first_score)) AS score_per_day
FROM
(
select customer_id,
MAX(date(purchase_date)) as last_score,
MIN(date(purchase_date)) as first_score,
SUM(score) AS total_score
FROM candidate_test_q1
group by customer_id
) AS sub_query

What is the datatype of Months(only months) in postgres?

I need to create a table with a column named as "months" in postgresql."Month" column should have January, February,etc not as 1, 2,3, etc. And I need to retrieve data ordered by months. what is the datatype I should used and how can I retrieve data ordered by month?
If you only need to save months, and not entire dates, I'd create an enum:
CREATE TYPE month_enum AS ENUM
('January',
'February',
'March',
'April',
'May',
'June',
'July',
'August',
'September',
'October',
'November',
'December'
);
It is best to save the month as an integer and show the month name at query time:
with months(month) as (
select generate_series(1, 12)
)
select
month as month_number,
to_char(
'1999-12-31'::date + month * interval '1 month',
'Month'
) as month_name
from months
order by month_number; -- or by month_name
month_number | month_name
--------------+------------
1 | January
2 | February
3 | March
4 | April
5 | May
6 | June
7 | July
8 | August
9 | September
10 | October
11 | November
12 | December
To make it easy to build the query create a function returning the month name:
create or replace function month_name(month integer)
returns text as $$
select
to_char(
'1999-12-31'::date + month * interval '1 month',
'Month'
);
$$ language sql;
Now it is simply:
with months(month) as (
select generate_series(1, 12)
)
select
month as month_number,
month_name(month)
from months
order by month_number;
From what you've asked, you have a few options depending on what you need:
If you only need Months as a static record but you don't actually need the time, you can use enum, as Mureinik answered,
If you need the Month as part of a specific time, you could use datetime.
Assuming you went with ENUM you can just use SELECT * FROM "Month" ORDER BY id ASC.
a_horse_with_no_name does have a point by saying that it might be best to use numerical values for months due to localization issues. You can make Month its own separate table for different month names in different languages, but there's probably a more effective way to do it. Alternatively, could have the number for each month and upon querying you could call the name of the month based on the number in your project like suggested. That way you can call a different name depending on localization.

How to find the average of certain records T-SQL

I have a table variable that I am dumping data into:
DECLARE #TmpTbl_SKUs AS TABLE
(
Vendor VARCHAR (255),
Number VARCHAR(4),
SKU VARCHAR(20),
PurchaseOrderDate DATETIME,
LastReceivedDate DATETIME,
DaysDifference INT
)
Some records don't have a purchase order date or last received date, so the days difference is null as well. I have done a lot of inner joins on itself, but data seems to take too long, or comes out incorrect most of the time.
Is it possible to get the average per SKU days difference? how would I check if there is only 1 record of that SKU? I need the data, if there is only 1 record, then I have to find it at a champvendor level the average.
Here is the structure:
Vendor has many Numbers and Numbers has many SKUs
Any help would be great, I can't seem to crack this one, nor can I find anything related to this online. Thanks in advance.
Here is some sample data:
Vendor Number SKU PurchaseOrderDate LastReceivedDate DaysDifference
OTHER PMDD 1111 OP1111 2009-08-21 00:00:00.000 2009-09-02 00:00:00.000 12
OTHER PMDD 1111 OP1112 2009-12-09 00:00:00.000 2009-12-17 00:00:00.000 8
MANTOR 3333 MA1111 2006-02-15 00:00:00.000 2006-02-23 00:00:00.000 8
MANTOR 3333 MA1112 2006-02-15 00:00:00.000 2006-02-23 00:00:00.000 8
I'm sorry I may have written this wrong. If there is only 1 SKU for a record, then I want to return the DaysDifference (if it's not null), if it has more than 1 record and they are not null, then return the average days difference. If it is all nulls, then at a vendor level check for the average of the skus that are not null, otherwise it should just return 7. This is what I have tried:
SELECT t1.SKU, ISNULL
(
AVG(t1.DaysDifference),
(
SELECT ISNULL(AVG(t2.DaysDifference), 7)
FROM #TmpTbl_SKUs t2
WHERE t2.SKU=t1.SKU
GROUP BY t2.ChampVendor, t2.VendorNumber, t2.SKU
)
)
FROM #TmpTbl_SKUs t1
GROUP BY t1.SKU
Keep playing with this. I somewhat have what I got, but just don't understand how I would check if it has multiple records, and how to check at a vendor level.
Try this:
EDITED: added NULLIF(..., 0) to treat 0s as NULLs.
SELECT
t1.SKU,
COALESCE(
NULLIF(AVG(t1.DaysDifference), 0),
NULLIF(t2.AvgDifferenceVendor, 0),
7
) AS AvgDiff
FROM #TmpTbl_SKUs t1
INNER JOIN (
SELECT Vendor, AVG(DaysDifference) AS AvgDifferenceVendor
FROM #TmpTbl_SKUs
GROUP BY Vendor
) t2 ON t1.Vendor = t2.Vendor
GROUP BY t1.SKU, t2.AvgDifferenceVendor
EDIT 2: how I tested the script.
For testing I'm using the sample data posted with the question.
DECLARE #TmpTbl_SKUs AS TABLE
(
Vendor VARCHAR (255),
Number VARCHAR(4),
SKU VARCHAR(20),
PurchaseOrderDate DATETIME,
LastReceivedDate DATETIME,
DaysDifference INT
)
INSERT INTO #TmpTbl_SKUs
(Vendor, Number, SKU, PurchaseOrderDate, LastReceivedDate, DaysDifference)
SELECT 'OTHER PMDD', '1111', 'OP1111', '2009-08-21 00:00:00.000', '2009-09-02 00:00:00.000', 12
UNION ALL
SELECT 'OTHER PMDD', '1111', 'OP1112', '2009-12-09 00:00:00.000', '2009-12-17 00:00:00.000', 8
UNION ALL
SELECT 'MANTOR', '3333', 'MA1111', '2006-02-15 00:00:00.000', '2006-02-23 00:00:00.000', 8
UNION ALL
SELECT 'MANTOR', '3333', 'MA1112', '2006-02-15 00:00:00.000', '2006-02-23 00:00:00.000', 8;
First I'm running the script on the unmodified data. Here's the result:
SKU AvgDiff
-------------------- -----------
MA1111 8
MA1112 8
OP1111 12
OP1112 8
AvgDiff for every SKU is identical to the original DaysDifference for every SKU, because there's only one row per each one.
Now I'm changing DaysDifference for SKU='MA1111' to 0 and running the script again. Ther result is:
SKU AvgDiff
-------------------- -----------
MA1111 4
MA1112 8
OP1111 12
OP1112 8
Now AvgDiff for MA1111 is 4. Why? Because the average for the SKU is 0, and so the average by Vendor is taken, which has been calculated as (0 + 8) / 2 = 4.
Next step is to set DaysDifference to 0 for all the SKUs of the same Vendor. In this case I'm setting it for SKUs MA1111 and MA1112. Here's the result of the script for this change:
SKU AvgDiff
-------------------- -----------
MA1111 7
MA1112 7
OP1111 12
OP1112 8
So now AvgDiff is 7 for both MA1111 and MA1112. How has it become so? Both have DaysDifference = 0. That means that the average by Vendor should be taken for each one. But Vendor average is 0 too in this case. According to the requirement, the average here should default to 7, which is what the script has returned.
So the script seems to be working correctly. I understand that it's either me having missed something or you having forgotten to mention some details. In any case, I would be glad to see where this script fails to solve your problem.