Creating Dummy Variable for Two Events in Dataset and Restricting Regression by Time Period in Stata - linear-regression

I am testing effect to two events in a dataset with a time dummy variable (post). For instance, my first event is in 1999 so I want to post-period after 1999 to 2003. Similarly, my second event is in 2010 so I want to set post period after 2010 to 2014. Accordingly, I want to set dummy variable (post) equals 1 for observations after post period and 0 for the pre periods.
One way I understand is to make a separate dataset for both the events and run. However, is there is anyway I could accomplish the same on this dataset ?
Additionally, if I want to run a liner regression is there is any command I could look into to restrict the regression for each event period ? So for first event run the regression until 2003 and for the second event until 2014
Any suggestions would be great. Thank you in advance
* Example generated by -dataex-. For more info, type help dataex
clear
input str9 Ticker int Year str23 Industry double INBD float(ROA Size MTB LEV LOSS)
"TH:2S" 1995 "Industrials" .36363636363636365 . . . . 0
"TH:2S" 1996 "Industrials" .2857142857142857 . . . . 0
"TH:2S" 1997 "Industrials" .3333333333333333 . . . . 0
"TH:2S" 1998 "Industrials" .375 . . . . 0
"TH:2S" 1999 "Industrials" .625 . . . . 0
"TH:2S" 2000 "Industrials" .26666666666666666 . . . . 0
"TH:2S" 2001 "Industrials" .375 . . . . 0
"TH:2S" 2002 "Industrials" .21428571428571427 . . . . 0
"TH:2S" 2003 "Industrials" .25 . . . . 0
"TH:2S" 2004 "Industrials" .3333333333333333 . . . . 0
"TH:2S" 2005 "Industrials" .3333333333333333 . . . . 0
"TH:2S" 2006 "Industrials" 0 .05430601 13.347824 . .4487669 0
"TH:2S" 2007 "Industrials" .23076923076923078 .0748898 13.513892 . .405519 0
"TH:2S" 2008 "Industrials" .42857142857142855 .20173776 13.234512 . .3000143 0
"TH:2S" 2009 "Industrials" 0 .1005717 13.72495 1.1260449 .3933205 0
"TH:2S" 2010 "Industrials" .2727272727272727 .06970939 13.839908 1.0841182 .43828955 0
"TH:2S" 2011 "Industrials" .4 .06131507 13.873565 .9524283 .4303303 0
"TH:2S" 2012 "Industrials" 0 .04201316 14.04106 .9658951 .5001149 0
"TH:2S" 2013 "Industrials" 0 .05576677 14.014296 1.0163808 .4338268 0
"TH:2S" 2014 "Industrials" 0 .034753945 14.002456 1.6368295 .4192463 0
"TH:A" 1995 "Property & Construction" .45454545454545453 . . . . 0
"TH:A" 1996 "Property & Construction" .5 . . . . 0
"TH:A" 1997 "Property & Construction" 0 . . . . 0
"TH:A" 1998 "Property & Construction" .25 . . . . 0
"TH:A" 1999 "Property & Construction" .13333333333333333 . . . . 0
"TH:A" 2000 "Property & Construction" .4 . . . . 0
"TH:A" 2001 "Property & Construction" .5 . . . . 0
"TH:A" 2002 "Property & Construction" .6 . . . . 0
"TH:A" 2003 "Property & Construction" .3333333333333333 .025012556 14.53853 . .6812668 0
"TH:A" 2004 "Property & Construction" 0 .04457069 15.205904 1.6980723 .6221218 0
"TH:A" 2005 "Property & Construction" 0 .0020386886 15.268938 2.07966 .6608869 0
"TH:A" 2006 "Property & Construction" .3333333333333333 .001907199 15.27291 1.964063 .6603242 0
"TH:A" 2007 "Property & Construction" 0 -.015466996 15.44163 1.918302 .7279775 1
"TH:A" 2008 "Property & Construction" .5 .005696189 15.559473 1.776193 .7537934 0
"TH:A" 2009 "Property & Construction" .46153846153846156 .09692752 15.491538 1.440211 .6395585 0
"TH:A" 2010 "Property & Construction" .42857142857142855 .07763009 15.522564 1.1848537 .5729401 0
"TH:A" 2011 "Property & Construction" .5 .013262192 15.561353 1.1070081 .57702786 0
"TH:A" 2012 "Property & Construction" 0 .013741923 15.742103 1.2696354 .6365387 0
"TH:A" 2013 "Property & Construction" 0 .0015351704 16.00987 1.7140155 .6814458 0
"TH:A" 2014 "Property & Construction" 0 .0034536426 16.29612 1.858621 .7594991 0
"TH:AA" 1995 "Industrials" .3333333333333333 -.006353655 16.525984 4.846756 .6370975 1
"TH:AA" 1996 "Industrials" 0 -.0462963 16.797165 5.059855 .7687468 1
"TH:AA" 1997 "Industrials" .3333333333333333 -.17009 17.534548 2.606452 .8492652 1
"TH:AA" 1998 "Industrials" .4166666666666667 .11416897 17.54763 .7374102 .6734178 0
"TH:AA" 1999 "Industrials" .25 -.11794188 17.270618 2.397937 .8667688 1
"TH:AA" 2000 "Industrials" .5384615384615384 .019639796 17.28194 1.632345 .8497713 0
"TH:AA" 2001 "Industrials" .2222222222222222 .006626872 17.217733 1.587646 .8325074 0
"TH:AA" 2002 "Industrials" .35714285714285715 .04131401 17.18029 1.2312752 .7847582 0
"TH:AA" 2003 "Industrials" .375 .05050331 17.203983 2.1328473 .7387649 0
"TH:AA" 2004 "Industrials" .3333333333333333 .05501763 17.133413 1.411802 .6652893 0
"TH:AA" 2005 "Industrials" 0 .064267844 17.207062 1.1283801 .5871332 0
"TH:AA" 2006 "Industrials" .2 .065511316 17.220182 1.377417 .5249598 0
"TH:AA" 2007 "Industrials" .4444444444444444 .015064287 17.167637 1.3882604 .4864591 0
"TH:AA" 2008 "Industrials" .25 . . . . 0
"TH:AA" 2009 "Industrials" .4444444444444444 . . . . 0
"TH:AA" 2010 "Industrials" .3333333333333333 . . . . 0
"TH:AA" 2011 "Industrials" .14285714285714285 . . . . 0
"TH:AA" 2012 "Industrials" 0 . . . . 0
"TH:AA" 2013 "Industrials" 0 . . . . 0
"TH:AA" 2014 "Industrials" 0 . . . . 0
"TH:AAV" 1995 "Services" .23076923076923078 . . . . 0
"TH:AAV" 1996 "Services" .375 . . . . 0
"TH:AAV" 1997 "Services" .3333333333333333 . . . . 0
"TH:AAV" 1998 "Services" 0 . . . . 0
"TH:AAV" 1999 "Services" 0 . . . . 0
"TH:AAV" 2000 "Services" .4666666666666667 . . . . 0
"TH:AAV" 2001 "Services" 0 . . . . 0
"TH:AAV" 2002 "Services" .375 . . . . 0
"TH:AAV" 2003 "Services" 0 . . . . 0
"TH:AAV" 2004 "Services" .2 . . . . 0
"TH:AAV" 2005 "Services" .5 . . . . 0
"TH:AAV" 2006 "Services" 0 . . . . 0
"TH:AAV" 2007 "Services" .375 . . . . 0
"TH:AAV" 2008 "Services" .36363636363636365 . . . . 0
"TH:AAV" 2009 "Services" .3333333333333333 -.08246704 14.541677 . 3.183686 1
"TH:AAV" 2010 "Services" 0 .413217 15.397943 . 1.514293 0
"TH:AAV" 2011 "Services" 0 .53301877 15.147836 . 1.1440629 0
"TH:AAV" 2012 "Services" .3333333333333333 .473854 17.312746 .8970968 .20764913 0
"TH:AAV" 2013 "Services" .3333333333333333 .023205843 17.620733 .6654471 .4063622 0
"TH:AAV" 2014 "Services" .3333333333333333 .003700512 17.71752 .7719533 .4542441 0
"TH:ABICO" 1995 "Agro & Food Industry" .36363636363636365 . . . . 0
"TH:ABICO" 1996 "Agro & Food Industry" .3333333333333333 . . . . 0
"TH:ABICO" 1997 "Agro & Food Industry" .25 . . . . 0
"TH:ABICO" 1998 "Agro & Food Industry" .4444444444444444 . . . . 0
"TH:ABICO" 1999 "Agro & Food Industry" .23076923076923078 . . . . 0
"TH:ABICO" 2000 "Agro & Food Industry" .3333333333333333 -.4316369 14.122723 . 2.2422733 1
"TH:ABICO" 2001 "Agro & Food Industry" .2727272727272727 .2456064 14.19583 . 1.2081953 0
"TH:ABICO" 2002 "Agro & Food Industry" .36363636363636365 -.13187617 14.014474 . 1.4432036 1
"TH:ABICO" 2003 "Agro & Food Industry" 0 -.20122433 13.85465 . 1.7235887 1
"TH:ABICO" 2004 "Agro & Food Industry" .5 -1.2570164 13.76754 . 3.056561 1
"TH:ABICO" 2005 "Agro & Food Industry" .2 2.2142339 13.705672 . .929168 0
"TH:ABICO" 2006 "Agro & Food Industry" .3333333333333333 .04549691 13.680099 . .9444808 0
"TH:ABICO" 2007 "Agro & Food Industry" .45454545454545453 .14073876 13.111873 . 1.5902317 0
"TH:ABICO" 2008 "Agro & Food Industry" 0 -.07876977 12.94573 . 1.7910005 1
"TH:ABICO" 2009 "Agro & Food Industry" .3333333333333333 .11251536 13.14271 . 1.2879436 0
"TH:ABICO" 2010 "Agro & Food Industry" 0 .05981953 13.178632 . 1.1793016 0
"TH:ABICO" 2011 "Agro & Food Industry" 0 .1542739 13.224712 . .9512552 0
"TH:ABICO" 2012 "Agro & Food Industry" .3 .22630814 13.71115 . .5845832 0
"TH:ABICO" 2013 "Agro & Food Industry" .3 .1281038 13.741155 . .54510576 0
"TH:ABICO" 2014 "Agro & Food Industry" .3 .10894921 13.825327 . .4710184 0
end
format %ty Year

* Indicator variable for pre/post
gen post = .
replace post = 0 if inrange(Year,1999,2003)
replace post = 1 if inrange(Year,2010,2014)
* Separate regressions for each event period
regress y x1 x2 if post == 0
regress y x1 x2 if post == 1

Related

How can I merge two data sets with ID variation in stata

I have following two data sets.
The first one from children looks like this.
ID year Q1 Q2 Q3 Q4 ....
101 2014 1 2 2 2
101 2016 1 2 2 2
101 2017 1 2 2 2
101 2018 1 2 2 2
401 2014 1 2 2 2
401 2015 1 2 3 3
401 2016 1 2 2 2
401 2017 1 2 1 1
401 2018 1 2 2 2
402 2014 1 1 0 3
402 2015 1 1 2 2
402 2016 1 1 2 2
402 2017 1 1 3 3
402 2018 1 1 2 3
Here's the second one from their parents.
ID year Q101 Q102
1 2014 1 3
1 2015 1 3
1 2016 1 3
1 2017 1 2
1 2018 1 2
2 2014 2 .
2 2015 1 2
2 2016 . .
2 2017 1 3
2 2018 2 .
4 2014 1 3
4 2015 1 3
4 2016 1 3
4 2017 1 3
4 2018 1 3
So the parent data ID can be matched to the children data ID deleted last two digits. It seems that parent ID 4 has two children.
I tried
merge 1:m ID using kids data as the master data set.
but it didn't work.
Thank you.
Getting good answers is made more likely by (a) attempting code and showing what you tried and (b) giving data in the form of code anybody using Stata can run. The code here follows from editing your post and is close to what you could get directly by using dataex as explained in the Stata tag wiki or indeed at help dataex in an up-to-date Stata or one in which you installed dataex from SSC.
clear
input ID year Q1 Q2 Q3 Q4
101 2014 1 2 2 2
101 2016 1 2 2 2
101 2017 1 2 2 2
101 2018 1 2 2 2
401 2014 1 2 2 2
401 2015 1 2 3 3
401 2016 1 2 2 2
401 2017 1 2 1 1
401 2018 1 2 2 2
402 2014 1 1 0 3
402 2015 1 1 2 2
402 2016 1 1 2 2
402 2017 1 1 3 3
402 2018 1 1 2 3
end
gen IDP = floor(ID/100)
save children
clear
input ID year Q101 Q102
1 2014 1 3
1 2015 1 3
1 2016 1 3
1 2017 1 2
1 2018 1 2
2 2014 2 .
2 2015 1 2
2 2016 . .
2 2017 1 3
2 2018 2 .
4 2014 1 3
4 2015 1 3
4 2016 1 3
4 2017 1 3
4 2018 1 3
end
rename ID IDP
merge 1:m IDP year using children
list
+----------------------------------------------------------------------+
| IDP year Q101 Q102 ID Q1 Q2 Q3 Q4 _merge |
|----------------------------------------------------------------------|
1. | 1 2014 1 3 101 1 2 2 2 matched (3) |
2. | 1 2015 1 3 . . . . . master only (1) |
3. | 1 2016 1 3 101 1 2 2 2 matched (3) |
4. | 1 2017 1 2 101 1 2 2 2 matched (3) |
5. | 1 2018 1 2 101 1 2 2 2 matched (3) |
|----------------------------------------------------------------------|
6. | 2 2014 2 . . . . . . master only (1) |
7. | 2 2015 1 2 . . . . . master only (1) |
8. | 2 2016 . . . . . . . master only (1) |
9. | 2 2017 1 3 . . . . . master only (1) |
10. | 2 2018 2 . . . . . . master only (1) |
|----------------------------------------------------------------------|
11. | 4 2014 1 3 401 1 2 2 2 matched (3) |
12. | 4 2015 1 3 401 1 2 3 3 matched (3) |
13. | 4 2016 1 3 402 1 1 2 2 matched (3) |
14. | 4 2017 1 3 401 1 2 1 1 matched (3) |
15. | 4 2018 1 3 402 1 1 2 3 matched (3) |
|----------------------------------------------------------------------|
16. | 4 2014 1 3 402 1 1 0 3 matched (3) |
17. | 4 2015 1 3 402 1 1 2 2 matched (3) |
18. | 4 2016 1 3 401 1 2 2 2 matched (3) |
19. | 4 2017 1 3 402 1 1 3 3 matched (3) |
20. | 4 2018 1 3 401 1 2 2 2 matched (3) |
+----------------------------------------------------------------------+
As far as the merge is concerned the essentials are identifiers with the same name(s) in both datasets and the correct pattern for merging. The parent identifier is only implied by the children dataset.

Create a conditional variable by group and time and day over the panel data

I am trying to analyse connection between probability of call and distance to vehicles.
The example dataset (here csv) looks like this:
id day time called d
1 2009-06-24 1700 0 1037.6
1 2009-06-24 1710 1 1191.9
1 2009-06-24 1720 0 165.5
The real dataset has 10 million rows. There are ids that represent locations that called or not, in different time windows of (here) 10 minutes.
I would like to first drop all rows with the same id that never called at this time at any date during the whole period.
Then I am left with rows that represent ids that called at some day during the analysis at the given time.
I would like to create a variable that at the row of the call has value of 0 and day before (or hour, week, month, whatever, but here day) at the same time it equals -1 and day after +1 etc. Later I would use that variable as input together with called and distance for analysis and comparison across different locations
I have looked for other answered questions but did not find something that fits. So answer or pointer to one would be appreciated. I am using Stata 13, but solving this with Postgres 9.3 or R would be welcome as well.
I shall need to repeat this procedure multiple times for several datasets, so ideally I would like to automatise as much as possible.
Update:
Here is example of the desired result:
id day time called d newvar newvar2
1 2009-06-24 1700 0 1037.6 null
1 2009-06-24 1710 1 1191.9 0 -2
1 2009-06-24 1720 0 165.5 -1
1 2009-06-25 1700 0 526.7 null
1 2009-06-25 1710 0 342.5 1 -1
1 2009-06-25 1720 1 416.1 0
1 2009-06-26 1700 0 428.3 null
1 2009-06-26 1710 1 240.7 2 0
1 2009-06-26 1720 0 228.7 1
1 2009-06-27 1700 0 282.5 null
1 2009-06-27 1710 0 182.1 3 1
1 2009-06-27 1720 0 195.5 2
2 2009-06-24 1700 0 198.0 -1
2 2009-06-24 1710 0 157.4 null
2 2009-06-24 1720 0 234.9 null
2 2009-06-25 1700 1 247.0 0
I added the newvar2 because some locations might call several times at the given time window
When looking for a Stata solution, it is best to provide a data example using dataex (from SSC).
The problem is hard to visualize until the data is sorted by id and time (and further sorted by day). I did not convert the day variable to a Stata numeric date because, as constructed, the string sort order matches the natural date order.
You appear to want, for each call within a group of id time, the date offset in relation to the day of the call. This can be done by generating an order variable to track the index of the current observation within each id time group and then subtracting the index of the observation which makes the call.
Since you can have more than one call per time slot, you have to loop over the maximum number of calls in any given time slot in the data.
There's one difference with the results generated by this solution compared to yours: you seem to ignore the call on 2009-06-27 at 1710 for id == 2.
In the following example, the original data is presented sorted by id time day to provide the readers with a better intuition of what's going on.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str10 day int time byte called float distance str4 newvar byte newvar2
1 "2009-06-24" 1700 0 1037.6 "null" .
1 "2009-06-25" 1700 0 526.7 "null" .
1 "2009-06-26" 1700 0 428.3 "null" .
1 "2009-06-27" 1700 0 282.5 "null" .
1 "2009-06-24" 1710 1 1191.9 "0" -2
1 "2009-06-25" 1710 0 342.5 "1" -1
1 "2009-06-26" 1710 1 240.7 "2" 0
1 "2009-06-27" 1710 0 182.1 "3" 1
1 "2009-06-24" 1720 0 165.5 "-1" .
1 "2009-06-25" 1720 1 416.1 "0" .
1 "2009-06-26" 1720 0 228.7 "1" .
1 "2009-06-27" 1720 0 195.5 "2" .
2 "2009-06-24" 1700 0 198 "-1" .
2 "2009-06-25" 1700 1 247 "0" .
2 "2009-06-26" 1700 0 188.7 "1" .
2 "2009-06-27" 1700 0 203.5 "2" .
2 "2009-06-24" 1710 0 157.4 "null" .
2 "2009-06-25" 1710 0 221.3 "null" .
2 "2009-06-26" 1710 0 283.8 "null" .
2 "2009-06-27" 1710 1 91.7 "null" .
2 "2009-06-24" 1720 0 234.9 "null" .
2 "2009-06-25" 1720 0 249.6 "null" .
2 "2009-06-26" 1720 0 279.7 "null" .
2 "2009-06-27" 1720 0 198.2 "null" .
3 "2009-06-24" 1700 0 156.1 "-1" .
3 "2009-06-25" 1700 1 19.9 "0" .
3 "2009-06-26" 1700 0 195.2 "1" .
3 "2009-06-27" 1700 0 306.2 "2" .
3 "2009-06-24" 1710 0 150.1 "null" .
3 "2009-06-25" 1710 0 163.7 "null" .
3 "2009-06-26" 1710 0 288.2 "null" .
3 "2009-06-27" 1710 0 311.7 "null" .
3 "2009-06-24" 1720 0 135.1 "-2" .
3 "2009-06-25" 1720 0 186 "-1" .
3 "2009-06-26" 1720 1 297.2 "0" .
3 "2009-06-27" 1720 0 375.9 "1" .
end
* order observations by date within a id time group
sort id time day
by id time: gen order = _n
* number of calls at any given time
by id time: gen call = sum(called)
* repeat enough to cover the max number of calls per time
sum call, meanonly
local n = r(max)
forvalues i = 1/`n' {
// the index of the called observation in the id time group
by id time: gen index = order if called & call == `i'
// replicate the index for all observations in the id time group
by id time: egen gindex = total(index)
// the relative position of each obs in groups with a call
gen wanted`i' = order - gindex if gindex > 0
drop index gindex
}
list, sepby(id time) noobs compress
and the results
. list, sepby(id time) noobs compress
+----------------------------------------------------------------------------------------+
| id day time cal~d dist~e new~r new~2 order call wan~1 wan~2 |
|----------------------------------------------------------------------------------------|
| 1 2009-06-24 1700 0 1037.6 null . 1 0 . . |
| 1 2009-06-25 1700 0 526.7 null . 2 0 . . |
| 1 2009-06-26 1700 0 428.3 null . 3 0 . . |
| 1 2009-06-27 1700 0 282.5 null . 4 0 . . |
|----------------------------------------------------------------------------------------|
| 1 2009-06-24 1710 1 1191.9 0 -2 1 1 0 -2 |
| 1 2009-06-25 1710 0 342.5 1 -1 2 1 1 -1 |
| 1 2009-06-26 1710 1 240.7 2 0 3 2 2 0 |
| 1 2009-06-27 1710 0 182.1 3 1 4 2 3 1 |
|----------------------------------------------------------------------------------------|
| 1 2009-06-24 1720 0 165.5 -1 . 1 0 -1 . |
| 1 2009-06-25 1720 1 416.1 0 . 2 1 0 . |
| 1 2009-06-26 1720 0 228.7 1 . 3 1 1 . |
| 1 2009-06-27 1720 0 195.5 2 . 4 1 2 . |
|----------------------------------------------------------------------------------------|
| 2 2009-06-24 1700 0 198 -1 . 1 0 -1 . |
| 2 2009-06-25 1700 1 247 0 . 2 1 0 . |
| 2 2009-06-26 1700 0 188.7 1 . 3 1 1 . |
| 2 2009-06-27 1700 0 203.5 2 . 4 1 2 . |
|----------------------------------------------------------------------------------------|
| 2 2009-06-24 1710 0 157.4 null . 1 0 -3 . |
| 2 2009-06-25 1710 0 221.3 null . 2 0 -2 . |
| 2 2009-06-26 1710 0 283.8 null . 3 0 -1 . |
| 2 2009-06-27 1710 1 91.7 null . 4 1 0 . |
|----------------------------------------------------------------------------------------|
| 2 2009-06-24 1720 0 234.9 null . 1 0 . . |
| 2 2009-06-25 1720 0 249.6 null . 2 0 . . |
| 2 2009-06-26 1720 0 279.7 null . 3 0 . . |
| 2 2009-06-27 1720 0 198.2 null . 4 0 . . |
|----------------------------------------------------------------------------------------|
| 3 2009-06-24 1700 0 156.1 -1 . 1 0 -1 . |
| 3 2009-06-25 1700 1 19.9 0 . 2 1 0 . |
| 3 2009-06-26 1700 0 195.2 1 . 3 1 1 . |
| 3 2009-06-27 1700 0 306.2 2 . 4 1 2 . |
|----------------------------------------------------------------------------------------|
| 3 2009-06-24 1710 0 150.1 null . 1 0 . . |
| 3 2009-06-25 1710 0 163.7 null . 2 0 . . |
| 3 2009-06-26 1710 0 288.2 null . 3 0 . . |
| 3 2009-06-27 1710 0 311.7 null . 4 0 . . |
|----------------------------------------------------------------------------------------|
| 3 2009-06-24 1720 0 135.1 -2 . 1 0 -2 . |
| 3 2009-06-25 1720 0 186 -1 . 2 0 -1 . |
| 3 2009-06-26 1720 1 297.2 0 . 3 1 0 . |
| 3 2009-06-27 1720 0 375.9 1 . 4 1 1 . |
+----------------------------------------------------------------------------------------+

computing the average of variables with missing values in Stata

I know how to calculate the avarage of variables without missing value, but I am not sure about calculating it with missing values. For example we have 6 area halls as follows:
area_hall_1 area_hall_2 area_hall_3 area_hall_4 area_hall_5 area_hall_6
580 580 650 . . .
1000 1000 . . .
825 825 . . . .
912 912 . . . .
670 . . . . .
790 . . . . .
750 900 1000 1000 900 750
The reported (or rather implied) problem makes no sense whatsoever. Consider the data posted (an extra missing value is needed in the second observation).
. clear
. input area_hall_1 area_hall_2 area_hall_3 area_hall_4 area_hall_5 area_hall_6
area_ha~1 area_ha~2 area_ha~3 area_ha~4 area_ha~5 area_ha~6
1. 580 580 650 . . .
2. 1000 1000 . . . .
3. 825 825 . . . .
4. 912 912 . . . .
5. 670 . . . . .
6. 790 . . . . .
7. 750 900 1000 1000 900 750
8. end
. egen area_hall_mean = rowmean(area_hall_?)
. egen area_hall_count = rownonmiss(area_hall_?)
. l *_mean *_count , sep(0)
+---------------------+
| area_h~n area_h~t |
|---------------------|
1. | 603.3333 3 |
2. | 1000 2 |
3. | 825 2 |
4. | 912 2 |
5. | 670 1 |
6. | 790 1 |
7. | 883.3333 6 |
+---------------------+
. di (580+580+650)/3
603.33333
The egen function rowmean() ignores missing values. How it could do otherwise? The only other possibility is to report that a mean cannot be calculated because there are missing values. That is defensible, but not at all typical Stata style. So the means reported are exactly those the OP wants. An independent calculation with display shows that the means reported are those desired. (A profound sceptic is at liberty to inspect the code with viewsource _growmean.ado.)

Dividing complex shapes into contiguous sub-shapes in MATLAB

I have a 3D shape loaded into MATLAB as a 3D matrix. The matrix is fairly large, e.g. 250x250x250. The shape is defined within the matrix by numbers >0 but <=1, so all positive numbers in the matrix are "shape", and all zeros are "non-shape". The shape is contiguous. A simplified (8x8) example of one plane of such a shape is shown in below:
0 0 0 0 0 0 0 0
0 0 1 .5 .1 .2 1 0
0 0 0 0 0 .3 0 0
0 0 .2 .3 1 1 1 1
0 0 0 .8 1 0 0 0
0 .2 .1 1 0 1 0 0
0 .1 .9 .9 .9 0 0 0
0 0 0 0 0 0 0 0
I need to split this shape into 2 sub-shapes where the sum of values of the two sub-shapes is roughly equal, and where the two sub-shapes are contiguous. So a valid division could be [N.B. zeros replaced by '.' for visual clarity]:
. . . . . . . .
. . B B B B B .
. . . . . B . .
. . A A B B B B
. . . A A . . .
. A A A . A . .
. A A A A . . .
. . . . . . . .
But the following division would be invalid because not all of the values in sub-shape B can be directly joined up with each other.
. . . . . . . .
. . B B B A A .
. . . . . A . .
. . B B A A A A
. . . B A . . .
. B B B . A . .
. B B B B . . .
. . . . . . . .
My real-world example is in 3 dimensions and much larger. Any ideas how I could divide my shape into 2 contiguous sub-shapes. By extension, how can I divide it into 3 contiguous sub-shapes if I wanted to, again where the sum of values in the sub-shapes is approximately equal?

Removing lines in one file that are present in another file

I have 2 .vcf files with genomic data and I want to remove lines in the 1st file that are also present in the second file. The code I have so far it seems to iterate only one time, removing the first hit and then stops the search. Any help would be very appreciated since I can not figure out where the problem is. Sorry for any mis-code...
I opted for arrays of arrays instead of hashes because the initial order of the file must be maintained, and I think that this is better achieved with arrays...
Code:
#!/usr/bin/perl
use strict;
use warnings;
## bring files to program
MAIN: {
my ($chrC, $posC, $junkC, $baserefC, $allrestC);
my (#ref_arrayH, #ref_arrayC);
my ($chrH, $posH, $baserefH);
my $entriesH;
# open the het file first
open (HET, $het) or die "Could not open file $het - $!";
while (<HET>) {
if (defined) {
chomp;
if (/^#/) { next; }
if ( /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) { # is a VCF file
my #line_arrayH = split(/\t/, $_);
push (#ref_arrayH, \#line_arrayH); # the "reference" of each line_array is now in each element of #ref_array
$entriesH = scalar #ref_arrayH; # count the number of entries in the het file
}
}
}
close HET;
# print $entriesH,"\n";
open (CNS, $cns) or die "Could not open file $cns - $!";
foreach my $refH ( #ref_arrayH ) {
$chrH = $refH -> [0];
$posH = $refH -> [1];
$baserefH = $refH -> [3];
foreach my $line (<CNS>) {
chomp $line;
if ($line =~ /^#/) { next; }
if ($line =~ /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) { # is a VCF file
($chrC, $posC, $junkC, $baserefC, $allrestC) = split(/\t/,$line);
if ( $chrC eq $chrH and $posC == $posH and $baserefC eq $baserefH ) { next }
else { print "$line\n"; }
}
}
}
# close CNS;
}
CNS file:
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1087 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
HET file:
chrI 1087 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
I would like the output to be like this
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
but is giving me this instead:
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
Why is this nested loop not working properly? If I want to keep this structure of array-of-arrays, why is only doing the iteration the first time?
Would it be better to change the foreach loop
foreach my $refH ( #ref_arrayH ) {
with a for loop
for (my $i = 0; $i <= $entriesH; $i++) {
?
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
open my $het, '<', 't.het' or die $!;
$seen{ $_ } = undef while <$het>;
close $het or die $!;
open my $cns, '<', 't.cns' or die $!;
while (my $line = <$cns>) {
next if exists $seen{ $line };
print $line;
}
close $cns or die $!;
If you want to match something other than entire lines, extract the field(s) and use it (or their combination) to key into the %seen hash.
This will use memory in proportion to the number of lines in the HET file.
To reduce memory usage, you can tie %seen to a DBM_File on disk.
If you are concerned about memory usage you can read both file one line at a time while doing the comparison. The following is an alternative approach.
Note: Because of the way filehandle works we have to reset connection every time we are to read from the file in the second nested loop.
#!/usr/bin/env perl
use strict;
use warnings;
open my $cns, '<', 't.cns' or die $!;
CNS:
while (my $line = <$cns>) { #read one line at a time from t.cns file.
open my $het, '<', 't.het' or die $!;
while (my $reference = <$het>){
if ($line eq $reference) { #test if current t.cns line is equal to any line in t.hex file.
close $het; #reset conection to t.hex file.
next CNS; # skip current t.cns line.
}
}
print $line;
close $het; #reset conection to t.hex file.
}
close $cns or die $!;