I have two datasets:
Panel - years from 2010 to 2020
Ranges for each firm (one or many) when they took a loan - also the range indicates the loan duration
First can look like this (each company has all the year-observation records):
id year
1 2010
1 ...
1 2020
2 2010
2 ...
2 2020
Second can look like this (so all variety, a company can have all years with a loan, some gaps in the beginning and in the end):
id start end
1 2010 2011
1 2011 2014
1 2017 2018
1 2012 2020
2 2014 2018
3 2011 2012
3 2015 2018
3 2018 2020
4 2011 2012
4 2015 2018
4 2010 2018
The idea is to merge the two, so each company gets 0 or 1 for a year, 1 if there was a loan that year (so any of the ranges matching), and 0 if a company didn't have a loan that year.
Example company based on the above would look like this:
id year flag
3 2010 0
3 2011 1
3 2012 1
3 2013 0
3 2014 0
3 2015 1
3 2016 1
3 2017 1
3 2018 1
3 2019 1
3 2020 1
Hope that makes sense. I tried inrange() but there are too many different scenarios and my code gets messy, thought there is a simple and clean way to do it.
If you work on the second dataset, you can get something fit to merge with your main dataset.
clear
input id start end
1 2010 2011
1 2011 2014
1 2017 2018
1 2012 2020
2 2014 2018
3 2011 2012
3 2015 2018
3 2018 2020
4 2011 2012
4 2015 2018
end
gen long ID = _n
gen toexpand = end - start + 1
expand toexpand
bysort ID : gen year = start + _n - 1
drop start end ID toexpand
duplicates drop id year, force
sort id year
list, sepby(id)
+-----------+
| id year |
|-----------|
1. | 1 2010 |
2. | 1 2011 |
3. | 1 2012 |
4. | 1 2013 |
5. | 1 2014 |
6. | 1 2015 |
7. | 1 2016 |
8. | 1 2017 |
9. | 1 2018 |
10. | 1 2019 |
11. | 1 2020 |
|-----------|
12. | 2 2014 |
13. | 2 2015 |
14. | 2 2016 |
15. | 2 2017 |
16. | 2 2018 |
|-----------|
17. | 3 2011 |
18. | 3 2012 |
19. | 3 2015 |
20. | 3 2016 |
21. | 3 2017 |
22. | 3 2018 |
23. | 3 2019 |
24. | 3 2020 |
|-----------|
25. | 4 2011 |
26. | 4 2012 |
27. | 4 2015 |
28. | 4 2016 |
29. | 4 2017 |
30. | 4 2018 |
+-----------+
After the merge 1:1 id year it is
gen flag = _merge == 3
```
Related
So I have the following query that produces the following result:
actname | year | tickets
---------------+----------+---------
Join Division | 2016 | 2
Join Division | 2018 | 2
Join Division | 2020 | 3
Join Division | Total | 7 <<<
QLS | 2018 | 2
QLS | 2019 | 1
QLS | Total | 3 <<<
Scalar Swift | 2017 | 3
Scalar Swift | 2018 | 1
Scalar Swift | 2019 | 1
Scalar Swift | Total | 5 <<<
The Selecter | 2017 | 4
The Selecter | 2018 | 4
The Selecter | Total | 8 <<<
The Where | 2016 | 1
The Where | 2017 | 3
The Where | 2018 | 5
The Where | 2020 | 4
The Where | Total | 13 <<<
ViewBee 40 | 2017 | 3
ViewBee 40 | 2018 | 1
ViewBee 40 | Total | 4 <<<
The problem I have is that I want to re-order the results such that the group with the lowest Total occurs first, such that the results would look like this:
actname | year | tickets
---------------+----------+---------
QLS | 2018 | 2
QLS | 2019 | 1
QLS | Total | 3 <<<
ViewBee 40 | 2017 | 3
ViewBee 40 | 2018 | 1
ViewBee 40 | Total | 4 <<<
Scalar Swift | 2017 | 3
Scalar Swift | 2018 | 1
Scalar Swift | 2019 | 1
Scalar Swift | Total | 5 <<<
Join Division | 2016 | 2
Join Division | 2018 | 2
Join Division | 2020 | 3
Join Division | Total | 7 <<<
The Selecter | 2017 | 4
The Selecter | 2018 | 4
The Selecter | Total | 8 <<<
The Where | 2016 | 1
The Where | 2017 | 3
The Where | 2018 | 5
The Where | 2020 | 4
The Where | Total | 13 <<<
I'm obtaining the results by using the following GROUP:
GROUP BY actname, ROLLUP(year)
Which is combining all the ticket amounts of the same actname and year together.
I can provide the full query if necessary!
Thanks
Using window function (which is sum() in this case) you can set value to groups (groups are partitioned by actname column) , so now every group from actname column, have same value, as its own row where year='Total'.
Then simply sort by that new column, something like this:
with t(actname, year, tickets) as (
VALUES
('Join Division','2016',2),
('Join Division','2018',2),
('Join Division','2020',3),
('Join Division','Total',7),
('QLS','2018',2),
('QLS','2019',1),
('QLS','Total',3 ),
('Scalar Swift','2017',3),
('Scalar Swift','2018',1),
('Scalar Swift','2019',1),
('Scalar Swift','Total',5 ),
('The Selecter','2017',4),
('The Selecter','2018',4),
('The Selecter','Total',8 ),
('The Where','2016',1),
('The Where','2017',3),
('The Where','2018',5),
('The Where','2020',4),
('The Where','Total',13 ),
('ViewBee 40','2017',3),
('ViewBee 40','2018',1),
('ViewBee 40','Total',4 )
)
SELECT * FROM (
select *, sum(case when year = 'Total' then tickets end) over(partition by actname) sm from t
) tt
ORDER BY sm, year
I just noticed that my code below is not actually a 7 day moving avg, and instead it is a 7 row moving avg. The dates in my table spans several months and I am trying to iron out since I have inconsistent data flow so I can't expect the last 7 rows of the window function to actually represent a 7 day avg. Thanks.
select date, sales,
avg(sales) over(order by date rows between 6 preceding and current row)
from sales_info
order by date
You can get a bit closer to a true 7 day moving average by using RANGE instead of ROWS for your range specification.
Read more about window function frames here.
I believe this should work for you:
select date, sales,
avg(sales) over(order by date range between '6 days' preceding and current row)
from sales_info
order by date;
Here's a demonstration with made up data:
SELECT i,
t,
avg(i) OVER (ORDER BY t RANGE BETWEEN '6 days' preceding and current row) FROM (
SELECT i, t
FROM generate_series('2021-01-01'::timestamp, '2021-02-01'::timestamp, '1 day') WITH ORDINALITY as g(t, i)
) sub;
i | t | avg
----+---------------------+------------------------
1 | 2021-01-01 00:00:00 | 1.00000000000000000000
2 | 2021-01-02 00:00:00 | 1.5000000000000000
3 | 2021-01-03 00:00:00 | 2.0000000000000000
4 | 2021-01-04 00:00:00 | 2.5000000000000000
5 | 2021-01-05 00:00:00 | 3.0000000000000000
6 | 2021-01-06 00:00:00 | 3.5000000000000000
7 | 2021-01-07 00:00:00 | 4.0000000000000000
8 | 2021-01-08 00:00:00 | 5.0000000000000000
9 | 2021-01-09 00:00:00 | 6.0000000000000000
10 | 2021-01-10 00:00:00 | 7.0000000000000000
11 | 2021-01-11 00:00:00 | 8.0000000000000000
12 | 2021-01-12 00:00:00 | 9.0000000000000000
13 | 2021-01-13 00:00:00 | 10.0000000000000000
14 | 2021-01-14 00:00:00 | 11.0000000000000000
15 | 2021-01-15 00:00:00 | 12.0000000000000000
16 | 2021-01-16 00:00:00 | 13.0000000000000000
17 | 2021-01-17 00:00:00 | 14.0000000000000000
18 | 2021-01-18 00:00:00 | 15.0000000000000000
19 | 2021-01-19 00:00:00 | 16.0000000000000000
20 | 2021-01-20 00:00:00 | 17.0000000000000000
21 | 2021-01-21 00:00:00 | 18.0000000000000000
22 | 2021-01-22 00:00:00 | 19.0000000000000000
23 | 2021-01-23 00:00:00 | 20.0000000000000000
24 | 2021-01-24 00:00:00 | 21.0000000000000000
25 | 2021-01-25 00:00:00 | 22.0000000000000000
26 | 2021-01-26 00:00:00 | 23.0000000000000000
27 | 2021-01-27 00:00:00 | 24.0000000000000000
28 | 2021-01-28 00:00:00 | 25.0000000000000000
29 | 2021-01-29 00:00:00 | 26.0000000000000000
30 | 2021-01-30 00:00:00 | 27.0000000000000000
31 | 2021-01-31 00:00:00 | 28.0000000000000000
32 | 2021-02-01 00:00:00 | 29.0000000000000000
I would like to find out the maximum number of rate options in a given month for each of my users. Here is what my rates table looks like:
Member | Month | Rate
Joe | Jan | 1
Joe | Jan | 2
Joe | Jan | 3
Joe | Feb | 1
Joe | Feb | 2
Joe | Feb | 2
Joe | Mar | 1
Joe | Mar | 2
Joe | Mar | 2
Max | Jan | 1
Max | Jan | 1
Max | Jan | 1
Max | Feb | 2
Max | Feb | 2
Max | Feb | 2
Max | Mar | 3
Max | Mar | 3
Max | Mar | 3
Ben | Jan | 1
Ben | Jan | 2
Ben | Jan | 2
Ben | Feb | 1
Ben | Feb | 1
Ben | Feb | 1
Ben | Mar | 1
Ben | Mar | 1
Ben | Mar | 1
Joe, in January, has rate options [1,2,3] available for him. Joe, in February and March, only has two [1,2]. For each user, I'd like to display the maximum number of rates available in one month (compared for all months). The outcome table should look like this:
Member | Max rates in one month
Joe | 3
Max | 1
Ben | 2
How would I write this query?
First, you need to group on member and month and count the rates:
SELECT "Member", "Month", count(*) AS n FROM rates GROUP BY 1,2;
Now that you have the count of rates per-member per-month, you can use the result to extract the maximum value for every member:
SELECT "Member", max(n) FROM (
SELECT "Member", "Month", count(*) AS n FROM rates GROUP BY 1,2) X
GROUP BY "Member";
The are other methods using, for example, window functions but the idea is always the same and I think the code above makes it clear what's happening.
You can do this with some aggregation:
SELECT Member, MAX(rate_count) as "Max rates in one month"
FROM
(SELECT Member, Month, COUNT(DISTINCT RATE) rate_count FROM yourtable GROUP BY Member, Month) dt
GROUP BY Member;
I have to append three datasets named A, B and C that contain data for various years (for example, 1990, 1991...2014).
The problem is that not all datasets contain all the survey years and therefore the unmatched years need to be dropped manually before appending.
I would like to know if there is any way to append three (or more) datasets that will keep only the matched variables across the datasets (years in this case).
Consider the following toy example:
clear
input year var
1995 0
1996 1
1997 2
1998 3
1999 4
2000 5
end
save data1, replace
clear
input year var
1995 6
1996 9
1998 7
1999 8
2000 9
end
save data2, replace
clear
input year var
1995 10
1996 11
1997 12
2000 13
end
save data3, replace
There is no option that will force append to do what you want, but you can do the following:
use data1, clear
append using data2 data3
duplicates tag year, generate(tag)
sort year
list
+------------------+
| year var tag |
|------------------|
1. | 1995 0 2 |
2. | 1995 6 2 |
3. | 1995 10 2 |
4. | 1996 9 2 |
5. | 1996 1 2 |
|------------------|
6. | 1996 11 2 |
7. | 1997 2 1 |
8. | 1997 12 1 |
9. | 1998 7 1 |
10. | 1998 3 1 |
|------------------|
11. | 1999 8 1 |
12. | 1999 4 1 |
13. | 2000 13 2 |
14. | 2000 5 2 |
15. | 2000 9 2 |
+------------------+
drop if tag == 1
list
+------------------+
| year var tag |
|------------------|
1. | 1995 0 2 |
2. | 1995 6 2 |
3. | 1995 10 2 |
4. | 1996 9 2 |
5. | 1996 1 2 |
|------------------|
6. | 1996 11 2 |
7. | 2000 13 2 |
8. | 2000 5 2 |
9. | 2000 9 2 |
+------------------+
You can also further generalize this approach by finding the maximum value of the variable tag and keeping all observations with that value:
summarize tag
keep if tag == `r(max)'
Say I have this table,
year | name | score
------+---------------+----------
2017 | BRAD | 5
2017 | BOB | 5
2016 | JON | 6
2016 | GUYTA | 2
2015 | PAC | 2
2015 | ZAC | 0
How would I go about averaging the scores by year and then getting the difference between years?
year | increase
------+-----------
2017 | 1
2016 | 3
You should use a window function, lead() in this case:
select year, avg, (avg - lead(avg) over w)::int as increase
from (
select year, avg(score)::int
from my_table
group by 1
) s
window w as (order by year desc);
year | avg | increase
------+-----+----------
2017 | 5 | 1
2016 | 4 | 3
2015 | 1 |
(3 rows)