How to accumulate values tsql - tsql

I have to solve a problem and don't know how to do it. Im using SQL Server 2012.
I have the data like this schema:
-----------------------------------------------------------------------------------
DriverId | BeginDate | EndDate | NextBegin | Rest in | Drive Time | Drive
| | | Date | Hours | in Minutes | KM
-----------------------------------------------------------------------------------
integer datetime datetime datetime integer integer decimal(10,3)
Rest in hours = EndDate - NextBeginDate
Drive Time in Minutes = BeginDate - EndDate
I have to search the first rest => 36 hours then
Do
Compute how many days are
SUM(DriveTime)
SUM(TotalKM)
until next rest => 36 hours
IF No More Rest EXIT DO
Loop
From the begining to the first Rest is discard
From the last Rest to the end is discard
I have data in excel sheet you can download from here: Download Excel with data example
I'm sorry for my english, I hope you can understand and help me, thank you in advance.

There are several parts to the query. The first part pulls out the rows where Rest is >= 36 and assigns a row number. The result is stored in a CTE called BigRest.
with BigRest(RowNumber, DriverId, BeginDate, EndDate)
as
(
select ROW_NUMBER() over(partition by d.DriverId order by d.DriverId, d.BeginDate) RowNumber,d.DriverId, d.BeginDate, d.EndDate
from Drive d
where d.Rest >= 36
)
Then I assign the row number from BigRest to each row in Drive (which is what I'm calling the table that has all the data in it) based on the BeginDate. So the data is effectively segmented by the days where Rest >= 36. Each segment gets a number called DriveGroup.
;with Grouped(DriverId, BeginDate, EndDate, DriveTime, DriveKM, DriveGroup)
as
(
select d.DriverId, d.BeginDate, d.EndDate, d.Drivetime, d.DriveKM, (select Top 1 RowNumber from BigRest b where b.DriverId = d.DriverId and b.BeginDate >= d.BeginDate order by b.BeginDate)
from Drive d
)
Finally, I select the data from Grouped, cross applying it with some aggregate data from itself. We can filter out the rows where the DriveGroup is 1 or null because those represent the beginning and end rows that don't matter (the "do nothing" rows).
select distinct DriverId, MinBeginDate BeginDate, MaxEndDate EndDate, DATEDIFF(D, MinBeginDate, MaxEndDate)+1 Days, DriveTimeSum Drive, DriveKMSum KM
from
(
select g.DriverId, g.BeginDate, g.EndDate, g.DriveGroup, g.DriveTime, c.DriveTimeSum, c.DriveKMSum, c.MinBeginDate, c.MaxEndDate
from Grouped g
cross apply(select SUM(g2.DriveTime) DriveTimeSum,
SUM(g2.DriveKM) DriveKMSum,
MIN(g2.BeginDate) MinBeginDate,
MAX(g2.EndDate) MaxEndDate
from Grouped g2
where g2.DriverId = g.DriverId
and g2.DriveGroup = g.DriveGroup) as c
where g.DriveGroup is not null
and g.DriveGroup > 1
) x
Here's a SQL Fiddle
I'd encourage you to look at the results at each step of the query to see what's actually going on.

Related

Show first and last value in table

I have an excel file with customer's purchasing details (sorted by date).
for example:
customer_id
date
$_Total_purchase
A
1/2/23
5
A
1/3/23
20
A
1/4/23
10
i want to show in table, one row for each customer, so the final table will be:
customer_id
date
purchase_counter
amount_of_last_purchase
amount_of_first_purchase
A
1/4/23
3
10
5
in my table, customer_id is a dimension.
for extracting the date, i use max(date) as measure
for purchase_counter i use count(customer_id)
for extracting 'amount_of_first_purchase', i use firstSortedValue('$_Total_purchase', date)
how can i extract 'amount_of_last_purchase'? is there maybe an aggregation function i can use?
Thanks in advance :)
The simple answer is that you can use -date in you expression and this will return the last record:
FirstSortedValue('$_Total_purchase', -date)
The above will work for the provided data example. When there are more than one customer then Aggr function can help:
First: FirstSortedValue(aggr(sum($_Total_purchase), customer_id, date), date)
Last: FirstSortedValue(aggr(sum($_Total_purchase), customer_id, date), -date)
Another approach (if applied to your case/data) is to flag the first and last records during the data load and use the flags in the measures.
An example script:
RawData:
Load * Inline [
customer_id, date, $_Total_purchase
A, 2/1/23, 5
A, 3/1/23, 20
A, 4/1/23, 10
B, 5/1/23, 35
B, 6/1/23, 40
B, 7/1/23, 50
];
Temp0:
Load
customer_id,
date,
// flag the first record
// if the current row is the beggining of the table then flag as isFirst = 1
// if the customer_id for the current row is different from the previously loaded >-
// customer_id then flag as isFirst = 1
if(RowNo() = 1 or customer_id <> peek(customer_id), 1, null()) as isFirst,
// getting the last is a bit more tricky
// similar logic - if the currrent and previous customer_id are different >-
// or it is the end of the table then get the current customer_id and date >-
// and combine their values. Values are separeted with | ELSE write 0.
// for example: A|4/1/23 or B|7/1/23
if(customer_id <> peek(customer_id) and RowNo() <> 1, peek(customer_id) & '|' & peek(date),
if(RowNo() = NoOfRows('RawData'), customer_id & '|' & date, 0
)) as isLastTemp
Resident
RawData
;
// Get all the data from Temp0 for which isLastTemp is not equal to 0
// split isLastTemp by | -> fist value is customer_id and second is date
// join the result back to the otiginal table
join (RawData)
Load
SubField(isLastTemp, '|', 1) as customer_id,
SubField(isLastTemp, '|', 2) as date,
1 as isLast
Resident
Temp0
Where
isLastTemp <> 0
;
// join Temp0 to the original table
// but only grab the isFirst flag
join(RawData)
Load
customer_id,
date,
isFirst
Resident
Temp0
;
// this table is no longer needed
Drop Table Temp0;
Once the above script is reloaded RawData table will have two more columns - isFirst and isLast:
Then the expressions are simpler:
First: sum( {< isFirst = {1} >} $_Total_purchase)
Last: sum( {< isLast = {1} >} $_Total_purchase)
import pandas as pd
# read excel file
df = pd.read_excel('customer_purchases.xlsx')
# get first value
first_value = df.head(1)
# get last value
last_value = df.tail(1)
you can do with pandas also

Postgres Function: how to return the first full set of data that occurs after specified date/time

I have a requirement to extract rows of data, but only if all said rows make a full set. We have a sequence table that is updated every minute, with data for 80 bins. We need to know the status of bins 1 thru 80 every minute as part of our production process.
I am generating a new report (postgres function) that needs to take a snapshot at roughly 00:01:00:AM (IE 1 minute past midnight). Initially I thougtht this to be an easy task, just grab the first 80 rows of data that occur at/after this time, however I see that, depending on network activity and industrial computer priorities, the table is not religiously updated at exactly 00:01:00AM or any minute for that matter. Updates can occur milliseconds or even seconds later, and take 500ms to 800ms to update the database. Sometimes a given minute can be missing altogether (production processes take precedence over data capture, but the sequence data is not super critical anyway)
My thinking is it would be more reliable to look for the first complete set of data anytime from 00:01:00AM onwards. So effectively, I have a table that looks a bit like this:
Apologies, I know you prefer for images of this manner to not be pasted in this manner, but I could not figure out how to create a textual table like this here (carriage return or Enter button is ignored!)
Basically, the above table is typical, but 1st minute is not guaranteed, and for that matter, I would not be 100% confident that all 80 bins are logged for a given minute. Hence my question: how to return the first complete set of data, where all 80 bins (rows) have been captured for a particular minute?
Thinking about it, I could do some sort of rowcount in the function, ensuring there are 80 rows for a given minute, but this seems less intuitive. I would like to know for sure that for each row of a given minute, bin 1 is represented, bint 2, bin 3...
Ultimately a call to this function will supply a min/max date/time and that period of time will be checked for the first available minute with a full set of bins data.
I am reasonably sure this will involve a window function, as all rows have to be assessed prior to data extraction. I've used windows functions a few times now, but still a green newbie compared to others here, so help is appreciated.
My final code, thanks to help from #klin:-
StartTime = DATE_TRUNC('minute', tme1);
EndTime = DATE_TRUNC('day', tme1) + '23 hours'::interval;
SELECT "BinSequence".*
FROM "BinSequence"
JOIN(
SELECT "binMinute" AS binminute, count("binMinute")
FROM "BinSequence"
WHERE ("binTime" >= StartTime) AND ("binTime" < EndTime)
GROUP BY 1
HAVING COUNT (DISTINCT "binBinNo") = 80 -- verifies that each and every bin is represented in returned data
) theseTuplesOnly
ON theseTuplesOnly.binminute = "binMinute"
WHERE ("binTime" >= StartTime) AND ("binTime" < EndTime)
GROUP BY 1
ORDER BY 1
LIMIT 80
Use the aggregate function count(*) grouping data by minutes (date_trunc('minute', datestamp) gives full minutes from datestamp), e.g.:
create table bins(datestamp time, bin int);
insert into bins values
('00:01:10', 1, 'a'),
('00:01:20', 2, 'b'),
('00:01:30', 3, 'c'),
('00:01:40', 4, 'd'),
('00:02:10', 3, 'e'),
('00:03:10', 2, 'f'),
('00:03:10', 3, 'g'),
('00:03:10', 4, 'h');
select date_trunc('minute', datestamp) as minute, count(bin)
from bins
group by 1
order by 1
minute | count
----------+-------
00:01:00 | 4
00:02:00 | 1
00:03:00 | 3
(3 rows)
If you are not sure that all bins are unique in consecutive minutes, use distinct (this will make the query slower):
select date_trunc('minute', datestamp) as minute, count(distinct bin)
...
You cannot select counts in aggregated minnutes and all columns of the table in a single simple select. If you want to do that, you should join a derived table or use the operator in or use a window function. A join seems to be the simplest:
select b.*, count
from bins b
join (
select date_trunc('minute', datestamp) as minute, count(bin)
from bins
group by 1
having count(bin) = 4
) s
on date_trunc('minute', datestamp) = minute
order by 1;
datestamp | bin | param | count
-----------+-----+-------+-------
00:01:10 | 1 | a | 4
00:01:20 | 2 | b | 4
00:01:30 | 3 | c | 4
00:01:40 | 4 | d | 4
(4 rows)
Note also how to use having() to filter results in the above query.
You can test the query here.

Is there any way to match multiple date ranges for inclusion in other multiple ranges in postgresql

For example I have in database allowed ranges - (08:00-12:00), (12:00-15:00) and requested range I want to test - (09:00-14:00). Is there any way to understand that my test range is included in allowed range in database. It can be splited in even more parts, I just want to know if my range fully fits to list of time ranges in database.
You don't provide table structure, so I have no idea of data type. lets assume those are texts:
t=# select '(8:00, 12:30)' a,'(12:00, 15:00)' b,'(09:00, 14:00)' c;
a | b | c
---------------+----------------+----------------
(8:00, 12:30) | (12:00, 15:00) | (09:00, 14:00)
(1 row)
then how you can do it:
t=# \x
Expanded display is on.
t=# with d(a,b,c) as (values('(8:00, 12:30)','(12:00, 15:00)','(09:00, 14:00)'))
, w as (select '2017-01-01 ' h)
, timerange as (
select
tsrange(concat(w.h,split_part(substr(a,2),',',1))::timestamp,concat(w.h,split_part(a,',',2))::timestamp) ta
, tsrange(concat(w.h,split_part(substr(b,2),',',1))::timestamp,concat(w.h,split_part(b,',',2))::timestamp) tb
, tsrange(concat(w.h,split_part(substr(c,2),',',1))::timestamp,concat(w.h,split_part(c,',',2))::timestamp) tc
from w
join d on true
)
select *, ta + tb glued, tc <# ta + tb fits from timerange;
-[ RECORD 1 ]----------------------------------------
ta | ["2017-01-01 08:00:00","2017-01-01 12:30:00")
tb | ["2017-01-01 12:00:00","2017-01-01 15:00:00")
tc | ["2017-01-01 09:00:00","2017-01-01 14:00:00")
glued | ["2017-01-01 08:00:00","2017-01-01 15:00:00")
fits | t
first you need to "cast" your time to timestamp, as there is no timerange in postgres, so we take same day for all times (w.h = 2017-01-01) and convert a,b,c to ta,tb,tc with default including brackets (which totally fits our case).
then use union https://www.postgresql.org/docs/current/static/functions-range.html#RANGE-FUNCTIONS-TABLE operator to get "glued" interval
lastly check if the range is contained by the larger one with <# operator

Postgres query including time range

I have a query that pulls part of the data that I need for tracking. What I need to add is either a column that includes the date or the ability to query the table for a date range. I would prefer the column if possible. I am using psql 8.3.3.
Here is the current query:
select count(mailing.mailing_id) as mailing_count, department.name as org
from mailing, department
where mailing.department_id = department.department_id
group by department.name;
This returns the following information:
mailing_count | org
---------------+-----------------------------------------
2 | org1 name
8 | org2 name
22 | org3 name
21 | org4 name
39 | org5 name
The table that I am querying has 3 columns that have date in a timestamp format which are target_launch_date, created_time and modified_time.
When I try to add the date range to the query I get an error:
Query:
select count(mailing.mailing_id) as mailing_count, department.name as org
from mailing, department
where mailing.department_id = department.department_id
group by department.name,
WHERE (target_launch_date)>= 2016-09-01 AND < 2016-09-05;
Error:
ERROR: syntax error at or near "WHERE" LINE 1:
...department.department_id group by department.name,WHERE(targ...
I've tried moving the location of the date range in the string and a variety of other changes, but cannot achieve what I am looking for.
Any insight would be greatly appreciated!
Here's a query that would do what you need:
SELECT
count(m.mailing_id) as mailing_count,
d.name as org
FROM mailing m
JOIN department d USING( department_id )
WHERE
m.target_launch_date BETWEEN '2016-09-01' AND '2016-09-05'
GROUP BY 2
Since your target_launch_date is of type timestamp you can safely do <= '2016-09-05' which will actually convert to 2016-09-05 00:00:00.00000 giving you all the dates that are before start of that day or exactly 2016-09-05 00:00:00.00000
Couple of additional notes:
Use aliases for table names to shorten the code, eg. mailing m
Use explicit JOIN syntax to connect data from related tables
Apply your WHERE clause before GROUP BY to exclude rows that don't match it
Use BETWEEN operator to handle date >= X AND date <= Y case
You can use USING instead of ON in JOIN syntax when joined column names are the same
You can use column numbers in GROUP BY which point to position of a column in your select
To gain more insight on the matter of how processing of a SELECT statement behaves in steps look at the documentation.
Edit
Approach using BETWEEN operator would account 2015-09-05 00:00:00.00000 to the resultset. If this timestamp should be discarded change BETWEEN x AND y to either of those two:
(...) BETWEEN x AND y::timestamp - INTERVAL '1 microsecond'
(...) >= x AND (...) < y
You were close, you need to supply the column name on second part of where too and you would have a single where:
select count(mailing.mailing_id) as mailing_count, department.name as org
from mailing
inner join department on mailing.department_id = department.department_id
where target_launch_date >= '2016-09-01 00:00:00'
AND target_launch_date < '2016-09-05 00:00:00'
group by department.name;
EDIT: This part is just for Kamil G. showing clearly that between should NOT be used:
create table sample (id int, d timestamp);
insert into sample (id, d)
values
(1, '2016/09/01'),
(2, '2016/09/02'),
(3, '2016/09/03'),
(4, '2016/09/04'),
(5, '2016/09/05'),
(6, '2016/09/05 00:00:00'),
(7, '2016/09/05 00:00:01'),
(8, '2016/09/06');
select * from sample where d between '2016-09-01' and '2016-09-05';
Result:
1;"2016-09-01 00:00:00"
2;"2016-09-02 00:00:00"
3;"2016-09-03 00:00:00"
4;"2016-09-04 00:00:00"
5;"2016-09-05 00:00:00"
6;"2016-09-05 00:00:00"
BTW if you wouldn't believe without seeing explain, then here it is:
Filter: ((d >= '2016-09-01 00:00:00'::timestamp without time zone) AND
(d <= '2016-09-05 00:00:00'::timestamp without time zone))

Tableau - Calculating average where date is less than value from another data source

I am trying to calculate the average of a column in Tableau, except the problem is I am trying to use a single date value (based on filter) from another data source to only calculate the average where the exam date is <= the filtered date value from the other source.
Note: Parameters will not work for me here, since new date values are being added constantly to the set.
I have tried many different approaches, but the simplest was trying to use a calculated field that pulls in the filtered exam date from the other data source.
It successfully can pull the filtered date, but the formula does not work as expected. 2 versions of the calculation are below:
IF DATE(ATTR([Exam Date])) <= DATE(ATTR([Averages (Tableau Test Scores)].[Updated])) THEN AVG([Raw Score]) END
IF DATEDIFF('day', DATE(ATTR([Exam Date])), DATE(ATTR([Averages (Tableau Test Scores)].[Updated]))) > 1 THEN AVG([Raw Score]) END
Basically, I am looking for the equivalent of this in SQL Server:
SELECT AVG([Raw Score]) WHERE ExamDate <= (Filtered Exam Date)
Below a workbook that shows an example of what I am trying to accomplish. Currently it returns all blanks, likely due to the many-to-one comparison I am trying to use in my calculation.
Any feedback is greatly appreciated!
Tableau Test Exam Workbook
I was able to solve this by using Custom SQL to join the tables together and calculate the average based on my conditions, to get the column results I wanted.
Would still be great to have this ability directly in Tableau, but whatever gets the job done.
Edit:
SELECT
[AcademicYear]
,[Discipline]
--Get the number of student takers
,COUNT([Id]) AS [Students (N)]
--Get the average of the Raw Score
,CAST(AVG(RawScore) AS DECIMAL(10,2)) AS [School Mean]
--Get the number of failures based on an "adjusted score" column
,COUNT([AdjustedScore] < 70 THEN 1 END) AS [School Failures]
--This is the column used as the cutoff point for including scores
,[Average_Update].[Updated]
FROM [dbo].[Average] [Average]
FULL OUTER JOIN [dbo].[Average_Update] [Average_Update] ON ([Average_Update].[Id] = [Average].UpdateDateId)
--The meat of joining data for accurate calculations
FULL OUTER JOIN (
SELECT DISTINCT S.[Id], S.[LastName], S.[FirstName], S.[ExamDate], S.[RawScoreStandard], S.[RawScorePercent], S.[AdjustedScore], S.[Subject], P.[Id] AS PeriodId
FROM [StudentScore] S
FULL OUTER JOIN
(
--Get only the 1st attempt
SELECT DISTINCT [NBOMEId], S2.[Subject], MIN([ExamDate]) AS ExamDate
FROM [StudentScore] S2
GROUP BY [NBOMEId],S2.[Subject]
) B
ON S.[NBOMEId] = B.[NBOMEId] AND S.[Subject] = B.[Subject] AND S.[ExamDate] = B.[ExamDate]
--Group in "Exam Periods" based on the list of periods w/ start & end dates in another table.
FULL OUTER JOIN [ExamPeriod] P
ON S.[ExamDate] = P.PeriodStart AND S.[ExamDate] <= P.PeriodEnd
WHERE S.[Subject] = B.[Subject]
GROUP BY P.[Id], S.[Subject], S.[ExamDate], S.[RawScoreStandard], S.[RawScorePercent], S.[AdjustedScore], S.[NBOMEId], S.[NBOMELastName], S.[NBOMEFirstName], S.[SecondYrTake]) [StudentScore]
ON
([StudentScore].PeriodId = [Average_Update].ExamPeriodId
AND [StudentScore].Subject = [Average].Subject
AND [StudentScore].[ExamDate] <= [Average_Update].[Updated])
--End meat
--Joins to pull in relevant data for normalized tables
FULL OUTER JOIN [dbo].[Student] [Student] ON ([StudentScore].[NBOMEId] = [Student].[NBOMEId])
INNER JOIN [dbo].[ExamPeriod] [ExamPeriod] ON ([Average_Update].ExamPeriodId = [ExamPeriod].[Id])
INNER JOIN [dbo].[AcademicYear] [AcademicYear] ON ([ExamPeriod].[AcademicYearId] = [AcademicYear].[Id])
--This will pull only the latest update entry for every academic year.
WHERE [Updated] IN (
SELECT DISTINCT MAX([Updated]) AS MaxDate
FROM [Average_Update]
GROUP BY[ExamPeriodId])
GROUP BY [AcademicYear].[AcademicYearText], [Average].[Subject], [Average_Update].[Updated],
ORDER BY [AcademicYear].[AcademicYearText], [Average_Update].[Updated], [Average].[Subject]
I couldn't download your file to test with your data, but try reversing the order of taking the average ie
average(IF DATE(ATTR([Exam Date])) <= DATE(ATTR([Averages (Tableau Test Scores)].[Updated]) then [Raw Score]) END)
as written, I believe you'll be averaging the data before returning it from the if statement, whereas you want to return the data, then average it.