Distinct Count after Sum - tableau-api

So I am looking to do a count after aggregation. Basically I want to be able to total up the Inventory count with a sum and then count how many times each employee has a non zero inventory count.
So for this data Jack/Jimmy would have a count of 1, Sam would have a count of 2 and Steve would have a count of 0. I could easily do this in SQL on the back end but I also want them to be able to use a date parameter. So if they shifted the date to only 1/1/17 Sam would have a count of 1 and everyone else would have a 0. Any help would be much appreciated!
Data
Emp Item Inventory Date
Sam Crackers 1 1/1/2017
Jack Crackers 1 1/1/2017
Jack Crackers -1 2/1/2017
Jimmy Crackers -2 1/1/2017
Sam Apples 1 1/1/2017
Steve Apples -1 1/1/2017
Sam Cheese 1 1/1/2017
With Date>= '1/1/17':
Emp NonZeroCount
Sam 2
Jack 1
Jimmy 1
Steve 0
With Date = '1/1/17':
Emp NonZeroCount
Sam 1
Jack 0
Jimmy 0
Steve 0
SQL I envision it replacing
Create Table #Test(
Empl varchar(50),
Item Varchar (50),
Inventory int,
Date Date
)
Declare #DateParam Date
Set #DateParam = '1/1/17'
Insert into #Test (Empl,Item,Inventory,Date)
Values
('Sam','Crackers',1,'1/1/2017'),
('Jack','Crackers',1,'1/1/2017'),
('Jack','Crackers',-1,'2/1/2017'),
('Jimmy','Crackers',-2,'1/1/2017'),
('Sam','Apples',1,'1/1/2017'),
('Steve','Apples',-1,'1/1/2017'),
('Sam','Cheese',1,'1/1/2017');
Select
Item,Sum(Inventory) as Total
into #badItems
from #Test
Where Date >= #DateParam
group by Item
having Sum(Inventory) <> 0
Select
T.Empl,Count(Distinct BI.Item)
From #Test T
Inner Join #badItems BI on BI.Item = T.Item
group by T.Empl

This is a good case for creating a set in Tableau.
Select the Item field in the data pane on the left, and right click to create a set based on that field. Name it Bad Items, and define it using the following formula on the Condition tab, which assumes you've defined a parameter named [DateParam] of type Date.
sum(if [Date] >= [DateParam] then [Inventory] end) <> 0
You can then use the set on the filter shelf, row shelf, in calculations or combine with other sets as desired.
P.S. I used an alias to display the text "Bad Items" instead of "In" in the table, set a manual default sort order for the Emp field (in case you are trying to reproduce this exactly)

Related

Replace content in 'order' column with sequential numbers

I have an 'order' column in a table in a postgres database that has a lot of missing numbers in the sequence. I am having a problem figuring out how to replace the numbers currently in the column, with new ones that are incremental (see examples).
What I have:
id order name
---------------
1 50 Anna
2 13 John
3 2 Bruce
4 5 David
What I want:
id order name
---------------
1 4 Anna
2 3 John
3 1 Bruce
4 2 David
The row containing the lowest order number in the old version of the column should get the new order number '1', the next after that should get '2' etc.
You can use the window function row_number() to calculate the new numbers. The result of that can be used in an update statement:
update the_table
set "order" = t.rn
from (
select id, row_number() over (order by "order") as rn
from the_table
) t
where t.id = the_table.id;
This assumes that id is the primary key of that table.

Group events by sequence, defining the minimum period between sequences t-SQL

I have a table of events, called tbl_events that looks something like this:
PersonID Date
1 30/03/2015
1 22/04/2015
1 30/06/2015
2 18/07/2016
2 09/12/2016
2 28/04/2017
3 01/10/2014
3 28/11/2016
3 28/11/2016
3 16/01/2017
4 13/04/2017
4 09/05/2017
I want to be able to group these events up by the start date of each 'sequence', with a sequence being defined as a run of events from the first identified to the last identified for each PersonID. The last event in a sequence is defined as the event where thereafter there are no subsequent events for that PersonID for a year.
The result of this I would expect to look like is below:
PersonID FirstDate Sequence Events
1 30/03/2015 1 3
2 18/07/2016 1 3
3 01/10/2014 1 1
3 28/11/2016 2 3
4 13/04/2017 1 2
I am able to identify the sequences in Excel and pivot the data, but I need to be able to do this in SQL.
Here is the formula I have used in Excel to generate the sequence number (I am populating cell C3, with column A being PersonID and B being Date):
=+IF(A2<>A3,1,IF((B3-B2)<365,C2,C2+1))
I have joined the table back on itself using ROW_NUMBER to get the difference between the Date and the previous event date for that ID, but I'm not really sure where to go from there.
Any help is much appreciated.
My solution is based on the sample data you've provided along with your excel formula.
-- easily consumable sample data
DECLARE #tbl_events TABLE (PersonId int, [date] date)
INSERT #tbl_events VALUES
(1,'20150330'),(1,'20150422'),(1,'20150630'),(2,'20160718'),(2,'20161209'),(2,'20170428'),
(3,'20141001'),(3,'20161128'),(3,'20161128'),(3,'20170116'),(4,'20170413'),(4,'20170509');
-- Solution
WITH groupings AS
(
SELECT
PersonId,
FirstDate = MIN([date]) OVER (PARTITION BY personId ORDER BY [date]),
NextDate = LAG([date],1,[date]) OVER (PARTITION BY personId ORDER BY [date]),
[date],
grouper =
DATEDIFF(DAY, MIN([date]) OVER (PARTITION BY personId ORDER BY [date]), [date]) / 365
FROM #tbl_events
),
Prep AS
(
SELECT
PersonId,
firstDate = IIF(grouper = 0, FirstDate, IIF(FirstDate = NextDate, [date],NextDate))
FROM groupings
)
SELECT
PersonId,
FirstDate,
[Sequence] = ROW_NUMBER() OVER (PARTITION BY personId ORDER BY FirstDate),
[Events] = COUNT(*)
FROM prep
GROUP BY personId, FirstDate;
Results
PersonId FirstDate Sequence Events
----------- ---------- -------------------- -----------
1 2015-03-30 1 3
2 2016-07-18 1 3
3 2014-10-01 1 1
3 2016-11-28 2 3
4 2017-04-13 1 2
First note all years have 365 days, nonetheless, I'm using 365 to emulate your excel logic; this would need to be updated to account for leap years. Next, like your excel formula - this will only be correct when there are two sequences;
it would not work when, say personId has a date of jan 1 2015, then jan 10 2016, then feb 1 2017.Let us know if we need logic to accommodate for the aforementioned scenarios.
Lastly this solution uses LAG which requires SQL Server 2012+, if you're working with an earlier version of SQL the query will have to be updated accordingly.

How can 'brand new, never before seen' IDs be counted per month in redshift?

A fair amount of material is available detailing methods utilising dense_rank() and the like to count distinct somethings per month, however, I've been unable to find anything that allows a count of distinct per month which also removes/discounts any id's that have been seen in prior month groups.
The data can be imagined like so:
id (int8 type) | observed time (timestamp utc)
------------------
1 | 2017-01-01
2 | 2017-01-02
1 | 2017-01-02
1 | 2017-02-02
2 | 2017-02-03
3 | 2017-02-04
1 | 2017-03-01
3 | 2017-03-01
4 | 2017-03-01
5 | 2017-03-02
The process of the count can be seen as:
1: in 2017-01 we saw devices 1 and 2 so the count is 2
2: in 2017-02 we saw devices 1, 2 and 3. We know already about devices 1 and 2, but not 3, so the count is 1
3: in 2017-03 we saw devices 1, 3, 4 and 5. We already know about 1 and 3, but not 4 or 5, so the count is 2.
with the desired output being something like:
observed time | count of new id
--------------------------
2017-01 | 2
2017-02 | 1
2017-03 | 2
Explicitly, I am looking to have a new table, with an aggregated month per row, with a count of how many new ids occur within that month that have not been seen at all before.
The IRL case allows devices to be seen more than once in a month, but this shouldn't impact the count. It also uses integer for storage (both positive and negative) of the id, and time periods will be to the second in true timestamps. The size of the data set is also significant.
My initial attempt is along the lines of:
WITH records_months AS (
SELECT *,
date_trunc('month', observed_time) AS month_group
FROM my_table
WHERE observed_time > '2017-01-01')
id_months AS (
SELECT DISTINCT
month_group,
id
FROM records_months
GROUP BY month_group, id)
SELECT *
FROM id-months
However, I'm stuck on the next part i.e counting the number of new ID that were not seen in prior months. I believe the solution might be a window function, but I'm having trouble working out which or how.
First thing I thought of. The idea is to
(innermost query) calculate the earliest month that each id was seen,
(next level up) join that back to the main my_table dataset, and then
(outer query) count distinct ids by month after nulling out the already-seen ids.
I tested it out and got the desired result set. Joining the earliest month back to the original table seemed like the most natural thing to do (vs. a window function). Hopefully this is performant enough for your Redshift!
select observed_month,
-- Null out the id if the observed_month that we're grouping by
-- is NOT the earliest month that the id was seen.
-- Then count distinct id
count(distinct(case when observed_month != earliest_month then null else id end)) as num_new_ids
from (
select t.id,
date_trunc('month', t.observed_time) as observed_month,
earliest.earliest_month
from my_table t
join (
-- What's the earliest month an id was seen?
select id,
date_trunc('month', min(observed_time)) as earliest_month
from my_table
group by 1
) earliest
on t.id = earliest.id
)
group by 1
order by 1;

How to calculate average number and give subquery label

I have two table "book" and "authorCollection". Because a book may have multi-authors, I hope to get the average number of authors in table "book" which published after year 2000(inclusive).
For example:
Table Book:
key year
1 2000
2 2001
3 2002
4 1999
Table authorCollection:
key author
1 Tom
1 John
1 Alex
1 Mary
2 Alex
3 Tony
4 Mary
The result should be (4 + 1 + 1) / 3 = 2;(key 4 publish before year 2000).
I write the following query statement, but not right, I need to get the number of result in subquery, but cannot give it a label "b", How can i solve this problem? And get the average number of author? I still confused about "COUNT(*) as count" meaning....Thanks.
SELECT COUNT(*) as count, b.COUNT(*) AS total
FROM A
WHERE key IN (SELECT key
FROM Book
WHERE year >= 2000
) b
GROUP BY key;
First, count number of authors for a key in a subquery. Next, aggregate needed values:
select avg(coalesce(ct, 0))
from book b
left join (
select key, count(*) ct
from authorcollection
group by 1
) a
using (key)
where year >= 2000;
A sample as well as handling 'divide by zero' error:
select case when count(distinct book.key)=0
then null
else count(authorCollection.key is not null)/count(distinct book.key)
end as avg_after_2000
from book
left join authorCollection on(book.key=authorCollection.key)
where book.year >= 2000

SELECT record based upon dates

Assuming data such as the following:
ID EffDate Rate
1 12/12/2011 100
1 01/01/2012 110
1 02/01/2012 120
2 01/01/2012 40
2 02/01/2012 50
3 01/01/2012 25
3 03/01/2012 30
3 05/01/2012 35
How would I find the rate for ID 2 as of 1/15/2012?
Or, the rate for ID 1 for 1/15/2012?
In other words, how do I do a query that finds the correct rate when the date falls between the EffDate for two records? (Rate should be for the date prior to the selected date).
Thanks,
John
How about this:
SELECT Rate
FROM Table1
WHERE ID = 1 AND EffDate = (
SELECT MAX(EffDate)
FROM Table1
WHERE ID = 1 AND EffDate <= '2012-15-01');
Here's an SQL Fiddle to play with. I assume here that 'ID/EffDate' pair is unique for all table (at least the opposite doesn't make sense).
SELECT TOP 1 Rate FROM the_table
WHERE ID=whatever AND EffDate <='whatever'
ORDER BY EffDate DESC
if I read you right.
(edited to suit my idea of ms-sql which I have no idea about).