T-SQL vlookup with fake calendar table? - tsql
I am rather new in T-SQL and I have to create a view, where the output will be as shown below:
enter image description here
But my sales table doesn't have any data about sales in February and May for customer ABC and no data in January for customer XYZ, but I really want to have 0 for these months. How to do it in T-SQL?
This is great question about a very important topic that, even many experienced developers need to touch up on. Being "relatively new at SQL" I wont just offer a solution, I'll explain the key concepts involved.
The Auxiliary Table Numbers
First lets learn about what a tally table, aka numbers table is all about.
What does this do?
SELECT N = 1 ;
It returns the number 1.
N
-----
1
How about this?
SELECT N = 1 FROM (VALUES(0)) AS e(N);
Same thing:
N
-----
1
What does this return?
SELECT N = 1 FROM (VALUES(0),(0),(0),(0),(0),(0)) AS e(n);
Here I'm leveraging the VALUES table constructer which allows for a list of values to be treated like a view. This returns:
N
-------
1
1
1
1
1
We don't need the ones, we need the rows. This will make more sense in a moment. Now, what does this do?
WITH e(N) AS (SELECT 1 FROM (VALUES(0),(0),(0),(0),(0)) AS e(n))
SELECT N = 1 FROM e e1;
It returns the same thing, five 1's, but I've wrapped the code into a CTE named e. Think of CTEs as inline unnamed views that you can reference multiple times. Now lets CROSS JOIN e to itself. This returns for 25 dummy rows (5*5).
WITH e(N) AS (SELECT 1 FROM (VALUES(0),(0),(0),(0),(0)) AS e(n))
SELECT N = 1 FROM e e1, e e2;
Next we leverage ROW_NUMBER() over our set of dummy values.
WITH E1(N) AS (SELECT 1 FROM (VALUES(0),(0),(0),(0),(0)) AS e(n))
SELECT N = ROW_NUMBER() OVER (ORDER BY(SELECT NULL)) FROM E1, E1 a;
Returns (truncated for brevity):
N
--------------------
1
2
3
...
24
25
Using as an auxiliary numbers table
#OneToTen is a table with random numbers 1 to 10. I need to count how many there are, returning 0 when there aren't any. NOTE MY COMMENTS:
;--== 2. Simple Use Case - Counting all numbers, including missing ones (missing = 0)
DECLARE #OneToTen TABLE (N INT);
INSERT #OneToTen VALUES(1),(2),(2),(2),(4),(8),(8),(10),(10),(10);
WITH E1(N) AS (SELECT 1 FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS e(n)),
iTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY(SELECT NULL)) FROM E1, E1 a)
SELECT
N = i.N,
Wrong = COUNT(*), -- WRONG!!! Don't do THIS, this counts ALL rows returned
Correct = COUNT(t.N) -- Correct, this counts numbers from #OneToTen AKA "t.N"
FROM iTally AS i -- Aux Table of numbers
LEFT JOIN #OneToTen AS t -- Table to evaluate
ON i.N = t.N -- LEFT JOIN #OneToTen numbers to our Aux table of numbers
WHERE i.N <= 10 -- We only need the numbers 1 to 10
GROUP BY i.N; -- Group by with no Sort!!!
This returns:
N Wrong Correct
----- ----------- -----------
1 1 1
2 3 3
3 1 0
4 1 1
5 1 0
6 1 0
7 1 0
8 2 2
9 1 0
10 3 3
Note that I show you the wrong and right way to do this. Note how COUNT(*) is wrong for this, you need COUNT(whatever you are counting).
Auxiliary table of Dates (AKA calendar table)
My we use our numbers table to create a calendar table.
;--== 3. Auxilliary Month/Year Calendar Table
DECLARE #Start DATE = '20191001',
#End DATE = '20200301';
WITH E1(N) AS (SELECT 1 FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS e(n)),
iTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY(SELECT NULL)) FROM E1, E1 a)
SELECT TOP(DATEDIFF(MONTH,#Start,#End)+1)
TheDate = f.Dt,
TheYear = YEAR(f.Dt),
TheMonth = MONTH(f.Dt),
TheWeekday = DATEPART(WEEKDAY,f.Dt),
DayOfTheYear = DATEPART(DAYOFYEAR,f.Dt),
LastDayOfMonth = EOMONTH(f.Dt)
FROM iTally AS i
CROSS APPLY (VALUES(DATEADD(MONTH, i.N-1, #Start))) AS f(Dt)
This returns:
TheDate TheYear TheMonth TheWeekday DayOfTheYear LastDayOfMonth
---------- ----------- ----------- ----------- ------------ --------------
2019-10-01 2019 10 3 274 2019-10-31
2019-11-01 2019 11 6 305 2019-11-30
2019-12-01 2019 12 1 335 2019-12-31
2020-01-01 2020 1 4 1 2020-01-31
2020-02-01 2020 2 7 32 2020-02-29
2020-03-01 2020 3 1 61 2020-03-31
You will only need the YEAR and MONTH.
The Auxiliary Customer table
Because you are performing aggregations (SUM,COUNT,etc.) against multiple customers we will also need an Auxiliary table of customers, more commonly known as a lookup or dimension.
SAMPLE DATA:
;--== Sample Data
DECLARE #sale TABLE
(
Customer VARCHAR(10),
SaleYear INT,
SaleMonth TINYINT,
SaleAmt DECIMAL(19,2),
INDEX idx_cust(Customer)
);
INSERT #sale
VALUES('ABC',2019,12,410),('ABC',2020,1,668),('ABC',2020,1,50), ('ABC',2020,3,250),
('CDF',2019,10,200),('CDF',2019,11,198),('CDF',2020,1,333),('CDF',2020,2,5000),
('CDF',2020,2,325),('CDF',2020,3,1105),('FRED',2018,11,1105);
Distinct list of customers for an "Auxilliary Table of Customers"
SELECT DISTINCT s.Customer FROM #sale AS s;
For my sample data we get:
Customer
----------
ABC
CDF
FRED
Putting it all together
Here I'm going to:
Create a numbers table
Use my numbers table to create a calendar table
Create an auxiliary Customer table from #sale
CROSS JOIN (combine) both tables for a "junk dimension"
LEFT JOIN our sales data to our calendar/customer auxiliary tables/junk dimension
Group by the auxiliary table values
SOLUTION:
;--==== SAMPLE DATA
DECLARE #sale TABLE
(
Customer VARCHAR(10),
SaleYear INT,
SaleMonth TINYINT,
SaleAmt DECIMAL(19,2),
INDEX idx_cust(Customer)
);
INSERT #sale
VALUES('ABC',2019,12,410),('ABC',2020,1,668),('ABC',2020,1,50), ('ABC',2020,3,250),
('CDF',2019,10,200),('CDF',2019,11,198),('CDF',2020,1,333),('CDF',2020,2,5000),
('CDF',2020,2,325),('CDF',2020,3,1105),('FRED',2018,11,1105);
;--==== START/END DATEs
DECLARE #Start DATE = '20191001',
#End DATE = '20200301';
;--==== FINAL SOLUTION
WITH -- 6.1. Auxilliary Table of numbers:
E1(N) AS (SELECT 1 FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS e(n)),
iTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY(SELECT NULL)) FROM E1, E1 a),
-- 6.2. Use numbers table to create an "Auxilliary Date Table" (Calendar Table):
MonthYear(SaleYear,SaleMonth) AS
(
SELECT TOP(DATEDIFF(MONTH,#Start,#End)+1) YEAR(f.Dt), MONTH(f.Dt)
FROM iTally AS i
CROSS APPLY (VALUES(DATEADD(MONTH, i.N-1, #Start))) AS f(Dt)
)
SELECT
Customer = cust.Customer,
MonthYear = CONCAT(cal.SaleYear,'-',cal.SaleMonth),
Sales = ISNULL(SUM(s.SaleAmt),0)
-- Auxilliary Table of Customers
FROM (SELECT DISTINCT s.Customer FROM #sale AS s) AS cust -- 6.3. Aux Customer Table
CROSS JOIN MonthYear AS cal -- 6.4. Cross join to create Calendar/Customer Junk Dimension
LEFT JOIN #sale AS s -- 6.5. Join #sale to Junk Dimension on Year,Month and Customer
ON s.SaleYear = cal.SaleYear
AND s.SaleMonth = cal.SaleMonth
AND s.Customer = cust.Customer
GROUP BY cust.Customer, cal.SaleYear, cal.SaleMonth -- 6.6. Group by Junk Dim values
ORDER BY cust.Customer, cal.SaleYear, cal.SaleMonth; -- Order by not required
RESULTS:
Customer MonthYear Sales
---------- ------------ ------------
ABC 2019-10 0.00
ABC 2019-11 0.00
ABC 2019-12 410.00
ABC 2020-1 718.00
ABC 2020-2 0.00
ABC 2020-3 250.00
CDF 2019-10 200.00
CDF 2019-11 198.00
CDF 2019-12 0.00
CDF 2020-1 333.00
CDF 2020-2 5325.00
CDF 2020-3 1105.00
FRED 2019-10 0.00
FRED 2019-11 0.00
FRED 2019-12 0.00
FRED 2020-1 0.00
FRED 2020-2 0.00
FRED 2020-3 0.00
Related
Left Join With COUNT()
I have the following query that giving me issues on the second JOIN/COUNT for the StatsStrategySessions table: SELECT fa.Id , CAST(fa.StatDate AS DATE) , COUNT(sa.CreatedDateTime) AS 'TotalApplications' , COUNT(ss.CreatedDateTime) AS 'TotalStrategySessions' FROM StatsFacebookAds fa LEFT JOIN StatsApplications sa ON CAST(fa.StatDate AS DATE) = CAST(sa.CreatedDateTime AS DATE) AND sa.LeadSourceId = 1 LEFT JOIN StatsStrategySessions ss ON CAST(fa.StatDate AS DATE) = CAST(ss.CreatedDateTime AS DATE) AND ss.LeadSourceId = 1 GROUP BY fa.Id , fa.StatDate It returns twice the amount that it should... It returns... Id TotalApplications TotalStrategySessions ----------- ---------- ----------------- --------------------- 1 2019-12-02 1 1 2 2019-12-03 0 0 3 2019-12-04 0 0 4 2019-12-05 4 4 With the second JOIN/COUNT doubles to 4 instead of what should be 2. When I run the code without the second JOIN/COUNT it returns as expected. The code works as I would expect it to. SELECT fa.Id , CAST(fa.StatDate AS DATE) , COUNT(sa.CreatedDateTime) AS 'TotalApplications' FROM StatsFacebookAds fa LEFT JOIN StatsApplications sa ON CAST(fa.StatDate AS DATE) = CAST(sa.CreatedDateTime AS DATE) AND sa.LeadSourceId = 1 GROUP BY fa.Id , fa.StatDate It returns what I expect it to... Id TotalApplications ----------- ---------- ----------------- 1 2019-12-02 1 2 2019-12-03 0 3 2019-12-04 0 4 2019-12-05 2 But as soon as I join the second table the numbers are not what I'm trying to display. It's been a while since I wrote tsql so hopefully it some I space on... Thanks for the assistance!
Try using a distinct count on the StatsApplications table: SELECT fa.Id, CAST(fa.StatDate AS DATE), COUNT(DISTINCT sa.CreatedDateTime) AS [TotalApplications], -- change is here COUNT(ss.CreatedDateTime) AS [TotalStrategySessions] FROM StatsFacebookAds fa LEFT JOIN StatsApplications sa ON CAST(fa.StatDate AS DATE) = CAST(sa.CreatedDateTime AS DATE) AND sa.LeadSourceId = 1 LEFT JOIN StatsStrategySessions ss ON CAST(fa.StatDate AS DATE) = CAST(ss.CreatedDateTime AS DATE) AND ss.LeadSourceId = 1 GROUP BY fa.Id, fa.StatDate; The concept here is that the second additional join to the StatsStrategySessions table runs the risk of duplicating all the records in the StatsApplications table, since each latter record might join to multiple records in the former. But, by taking the distinct count, we can remove the double counting. If you don't like this approach, then another way to handle this would be to join to a subquery on the two tables which finds the counts separately, and then join back to StatsFacebookAds.
Kdb q query data from one table based on the data from another table without join
I'm new in kdb/q. And the following is my question. Really hope someone who experts in kdb can help me out. I have two tables. Table t1 has two attributes: tp_time and id, which looks like: tp_time id ------------------------------ 2018.06.25T00:07:15.822 1 2018.06.25T00:07:45.823 3 2018.06.25T00:09:01.963 8 ... ... Table t2 has three attributes: tp_time, id, and price. For each id, it has lots of price at different tp_time. So the table t2 is really large, which looks like the following: tp_time id price ---------------------------------------- 2018.06.25T00:05:99.999 1 10.87 2018.06.25T00:06:05.823 1 10.88 2018.06.25T00:06:18.999 1 10.88 ... ... 2018.06.25T17:39:20.999 1 10.99 2018.06.25T17:39:23.999 1 10.99 2018.06.25T17:39:24.999 1 10.99 ... ... 2018.06.25T01:39:39.999 2 10.99 2018.06.25T01:39:41.999 2 10.99 2018.06.25T01:39:45.999 2 10.99 ... ... What I try to do is for each row in Table t1, find its price at the nearest time and its price at approximately 5 seconds later. For example, for the first row in table t1: 2018.06.25T00:07:15.822 1 The price at nearest time is 10.87 and the price at around 5 seconds later is 10.88. And my expected output table looks like the following: tp_time id price_1 price_2 ---------------------------------------------------- 2018.06.25T00:07:15.822 1 10.87 10.88 2018.06.25T00:07:45.823 3 SOME_PRICE SOME_PRICE 2018.06.25T00:09:01.963 8 SOME_PRICE SOME_PRICE ... ... The thing is I cannot join t1 and t2 because table t2 is so large and I will kill the server. I've try something like ...where tp_time within(time1, time2). But I'm not sure how to deal with the time1 and time2 varibles. Could someone gives me some helps on this questions? Thanks so much!
I'll recommend organizing the table t1 by applying the proper attributes so that when you join the tables, it will generate the results quickly. Since you are looking for the prevailing price and price after 5 seconds, You will need wj for this. the general syntax is : wj[w;c;t;(q;(f0;c0);(f1;c1))] w - begin and end time t & q - unkeyed tables; q should be sorted by `id`time with `p# on id c- names of the columns to be joined f0,f1 - aggregation functions In your case t2 should be sorted by `id`time with `p# on id q)t2:update `g#id from `id`tp_time xasc ([] tp_time:`time$10:20:30 + asc -10?10 ; id:10?3 ;price:10?10.) q)t1:([] tp_time:`time$10:20:30 + asc -3?5 ; id:1 1 1 ) q)select from t2 where id=1 tp_time id price 10:20:31.000 1 4.410662 10:20:32.000 1 5.473385 10:20:38.000 1 1.247049 q)wj[(`second$0 5)+\:t1.tp_time;`id`tp_time;t1;(t2;(first;`price);(last;`price))] tp_time id price price 10:20:30.000 1 4.410662 5.473385 10:20:31.000 1 4.410662 5.473385 10:20:34.000 1 5.473385 1.247049 //price at 32nd second & 38th second
T-SQL - Data Islands and Gaps - How do I summarise transactional data by month?
I'm trying to query some transactional data to establish the CurrentProductionHours value for each Report at the end of each month. Providing there has been a transaction for each report in each month, that's pretty straight-forward... I can use something along the lines of the code below to partition transactions by month and then pick out the rows where TransactionByMonth = 1 (effectively, the last transaction for each report each month). SELECT ReportId, TransactionId, CurrentProductionHours, ROW_NUMBER() OVER (PARTITION BY [ReportId], [CalendarYear], [MonthOfYear] ORDER BY TransactionTimestamp desc ) AS TransactionByMonth FROM tblSource The problem that I have is that there will not necessarily be a transaction for every report every month... When that's the case, I need to carry forward the last known CurrentProductionHours value to the month which has no transaction as this indicates that there has been no change. Potentially, this value may need to be carried forward multiple times. Source Data: ReportId TransactionTimestamp CurrentProductionHours 1 2014-01-05 13:37:00 14.50 1 2014-01-20 09:15:00 15.00 1 2014-01-21 10:20:00 10.00 2 2014-01-22 09:43:00 22.00 1 2014-02-02 08:50:00 12.00 Target Results: ReportId Month Year ProductionHours 1 1 2014 10.00 2 1 2014 22.00 1 2 2014 12.00 2 2 2014 22.00 I should also mention that I have a date table available, which can be referenced if required. ** UPDATE 05/03/2014 ** I now have query which is genertating results as shown in the example below but I'm left with islands of data (where a transaction existed in that month) and gaps in between... My question is still similar but in some ways a little more generic - What is the best way to fill gaps between data islands if you have the dataset below as a starting point? ReportId Month Year ProductionHours 1 1 2014 10.00 1 2 2014 12.00 1 3 2014 NULL 2 1 2014 22.00 2 2 2014 NULL 2 3 2014 NULL Any advice about how to tackle this would be greatly appreciated!
Try this: ;with a as ( select dateadd(m, datediff(m, 0, min(TransactionTimestamp))+1,0) minTransactionTimestamp, max(TransactionTimestamp) maxTransactionTimestamp from tblSource ), b as ( select minTransactionTimestamp TT, maxTransactionTimestamp from a union all select dateadd(m, 1, TT), maxTransactionTimestamp from b where tt < maxTransactionTimestamp ), c as ( select distinct t.ReportId, b.TT from tblSource t cross apply b ) select c.ReportId, month(dateadd(m, -1, c.TT)) Month, year(dateadd(m, -1, c.TT)) Year, x.CurrentProductionHours from c cross apply (select top 1 CurrentProductionHours from tblSource where TransactionTimestamp < c.TT and ReportId = c.ReportId order by TransactionTimestamp desc) x
A similar approach but using a cartesian to obtain all the combinations of report ids/months. in the first step. A second step adds to that cartesian the maximum timestamp from the source table where the month is less or equal to the month in the current row. Finally it joins the source table to the temp table by report id/timestamp to obtain the latest source table row for every report id/month. ; WITH allcombinations -- Cartesian (reportid X yearmonth) AS ( SELECT reportid , yearmonth FROM ( SELECT DISTINCT reportid FROM tblSource ) a JOIN ( SELECT DISTINCT DATEPART(yy, transactionTimestamp) * 100 + DATEPART(MM, transactionTimestamp) yearmonth FROM tblSource ) b ON 1 = 1 ), maxdates --add correlated max timestamp where the month is less or equal to the month in current record AS ( SELECT a.* , ( SELECT MAX(transactionTimestamp) FROM tblSource t WHERE t.reportid = a.reportid AND DATEPART(yy, t.transactionTimestamp) * 100 + DATEPART(MM, t.transactionTimestamp) <= a.yearmonth ) maxtstamp FROM allcombinations a ) -- join previous data to the source table by reportid and timestamp SELECT distinct m.reportid , m.yearmonth , t.CurrentProductionHours FROM maxdates m JOIN tblSource t ON t.transactionTimestamp = m.maxtstamp and t.reportid=m.reportid ORDER BY m.reportid , m.yearmonth
TSQL Join to get all records from table A for each record in table B?
I have two tables: PeriodId Period (Periods Table) -------- ------- 1 Week 1 2 Week 2 3 Week 3 EmpId PeriodId ApprovedDate (Worked Table) ----- -------- ------------ 1 1 Null 1 2 2/28/2013 2 2 2/28/2013 I am trying to write a query that results in this: EmpId Period Worked ApprovedDate ----- -------- --------- ------------ 1 Week 1 Yes Null 1 Week 2 Yes 2/28/2013 1 Week 3 No Null 2 Week 1 No Null 2 Week 2 Yes 2/28/2013 2 Week 3 No Null The idea is that I need each Period from the Periods table for each Emp. If there was no record in the Worked table then the 'No' value is placed Worked field. What does the TSQL look like to get this result? (Note: if it helps I also have access to an Employee table that has EmpId and LastName for each employee. For performance reasons I'm hoping not to need this but if I do then so be it.)
You should be able to use the following: select p.empid, p.period, case when w.PeriodId is not null then 'Yes' else 'No' End Worked, w.ApprovedDate from ( select p.periodid, p.period, e.empid from periods p cross join (select distinct EmpId from worked) e ) p left join worked w on p.periodid = w.periodid and p.empid = w.empid order by p.empid See SQL Fiddle with Demo
tsql PIVOT function
Need help with the following query: Current Data format: StudentID EnrolledStartTime EnrolledEndTime 1 7/18/2011 1.00 AM 7/18/2011 1.05 AM 2 7/18/2011 1.00 AM 7/18/2011 1.09 AM 3 7/18/2011 1.20 AM 7/18/2011 1.40 AM 4 7/18/2011 1.50 AM 7/18/2011 1.59 AM 5 7/19/2011 1.00 AM 7/19/2011 1.05 AM 6 7/19/2011 1.00 AM 7/19/2011 1.09 AM 7 7/19/2011 1.20 AM 7/19/2011 1.40 AM 8 7/19/2011 1.10 AM 7/18/2011 1.59 AM I would like to calculate the time difference between EnrolledEndTime and EnrolledStartTime and group it with 15 minutes difference and the count of students that enrolled in the time. Expected Result : Count(StudentID) Date 0-15Mins 16-30Mins 31-45Mins 46-60Mins 4 7/18/2011 3 1 0 0 4 7/19/2011 2 1 0 1 Can I use a combination of the PIVOT function to acheive the required result. Any pointers would be helpful.
Create a table variable/temp table that includes all the columns from the original table, plus one column that marks the row as 0, 16, 31 or 46. Then SELECT * FROM temp table name PIVOT (Count(StudentID) FOR new column name in (0, 16, 31, 46). That should put you pretty close.
It's possible (just see the basic pivot instructions here: http://msdn.microsoft.com/en-us/library/ms177410.aspx), but one problem you'll have using pivot is that you need to know ahead of time which columns you want to pivot into. E.g., you mention 0-15, 16-30, etc. but actually, you have no idea how long some students might take -- some might take 24-hours, or your full session timeout, or what have you. So to alleviate this problem, I'd suggesting having a final column as a catch-all, labeled something like '>60'. Other than that, just do a select on this table, selecting the student ID, the date, and a CASE statement, and you'll have everything you need to work the pivot on. CASE WHEN date2 - date1 < 15 THEN '0-15' WHEN date2-date1 < 30 THEN '16-30'...ELSE '>60' END.
I have an old version of ms sql server that doesn't support pivot. I wrote the sql for getting the data. I cant test the pivot, so I tried my best, couldn't test the pivot part. The rest of the sql will give you the exact data for the pivot table. If you accept null instead of 0, it can be written alot more simple, you can skip the "a subselect" part defined in "with a...". declare #t table (EnrolledStartTime datetime,EnrolledEndTime datetime) insert #t values('2011/7/18 01:00', '2011/7/18 01:05') insert #t values('2011/7/18 01:00', '2011/7/18 01:09') insert #t values('2011/7/18 01:20', '2011/7/18 01:40') insert #t values('2011/7/18 01:50', '2011/7/18 01:59') insert #t values('2011/7/19 01:00', '2011/7/19 01:05') insert #t values('2011/7/19 01:00', '2011/7/19 01:09') insert #t values('2011/7/19 01:20', '2011/7/19 01:40') insert #t values('2011/7/19 01:10', '2011/7/19 01:59') ;with a as (select * from (select distinct dateadd(day, cast(EnrolledStartTime as int), 0) date from #t) dates cross join (select '0-15Mins' t, 0 group1 union select '16-30Mins', 1 union select '31-45Mins', 2 union select '46-60Mins', 3) i) , b as (select (datediff(minute, EnrolledStartTime, EnrolledEndTime )-1)/15 group1, dateadd(day, cast(EnrolledStartTime as int), 0) date from #t) select count(b.date) count, a.date, a.t, a.group1 from a left join b on a.group1 = b.group1 and a.date = b.date group by a.date, a.t, a.group1 -- PIVOT(max(date) -- FOR group1 -- in(['0-15Mins'], ['16-30Mins'], ['31-45Mins'], ['46-60Mins'])AS p