I have 10 date variables in Stata for each ID in wide format, I want to check difference between each of the two consecutive dates and the difference between two consecutive dates shouldn't be less than 8 days
These types of tasks are generally easier with the panel data in long format:
clear
/* clean up data */
input str20(id visit1 visit2 visit3 visit4)
001 "2/3/2014" "2/5/2014" "2/7/2014" "2/10/2014"
002 "2/3/2014" "2/5/2014" "2/7/2014" "2/10/2014"
004 "2/19/2014" "2/21/2014" "2/24/2014" "2/26/2014"
005 "2/28/2014" "3/4/2014" "3/6/2014" "3/10/2014"
008 "3/14/2014" "3/18/2014" "3/20/2014" "3/25/2014"
end
foreach var of varlist visit* {
gen t = date(`var',"MDY")
format t %td
drop `var'
rename t `var'
}
/* reshape to long format */
destring id, gen(non_str_id)
reshape long visit, i(non_str_id) j(t)
xtset non_str_id t
assert D.visit > 8 & !missing(D.visit)
/* reshape back to wide after fixing problematic data */
drop non_str_id
reshape wide visit, i(id) j(t)
Related
this is how I sorted for lag variables data
tsset permno date, monthly
sort permno date
by permno: gen lagret1=ret[_n-1]
by permno: gen lagret2=ret[_n-2]
by permno: gen lagret3=ret[_n-3]
by permno: gen lagret4=ret[_n-4]
by permno: gen lagret5=ret[_n-5]
i don't know the rest
*Step 1: Upload the data and create key variables
*Upload the dataset that contains CRSP information and create key variables.
use "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/CRSPforMOM.dta", clear
*Keep only common stock
keep if shrcd == 10 | shrcd == 11
*Create monthindex variable
gen monthindex = year(date)*12+month(date)
*Create past 5 months of returns using lag function
*in order to use the built- in lag function I need to tell stata the
*structure of the data
tsset permno date, monthly
sort permno date
by permno: gen lagret1=ret[_n-1]
by permno: gen lagret2=ret[_n-2]
by permno: gen lagret3=ret[_n-3]
by permno: gen lagret4=ret[_n-4]
by permno: gen lagret5=ret[_n-5]
*Create a variable that captures cumulative retruns of stock i,
*from month -5 through current month
*Compounding requires multiplying consecutive returns
gen cumret6 = (1+ret)*(1+lagret1)*(1+lagret2)*(1+lagret3)*(1+lagret4)* (1+lagret5)
*Save
save "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM1.dta", replace
*Step 2: Create and apply filters
*Before allocating stocks to portfolios, we should create and apply filters
*Select only NYSE stocks and find the 10th percentil of NYSE size in each month
use "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM1.dta", clear
*Keep only NYSE stocks
keep if exchcd == 1
*Keep if market cap is larger than 0
keep if mktcap >0
*Drop missing observations where marketcap is missing
drop if missing(mktcap)
*Since we create portfolios monthly, we need breakpoints monthly
sort date
by date: egen p10=pctile(mktcap), p(10)
*We only need date variable (for merging) and p10 variable (as a filter),
*so we drop everything else
keep date p10
*Drop duplicates so that p10 repeats once for every month in the sample
duplicates drop
*save
save "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOMNYSEBreakpoints.dta", replace
*Merge the breakpoints into the dataset created in step 1,
*so that we can remove small firms
*Break points are date specific so merge on date
use "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM1.dta", clear
sort date
merge m:1 date using "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOMNYSEBreakpoints.dta"
*merge==3 indicates that an observation is present in both
*master and using datasets, that is the only data that is properly merged
*and the only data that should be kept
keep if _merge==3
*We need to drop _merge variable to be able to merge data again
drop _merge
*Apply filters, i.e. remove small firms and firms priced below $5
drop if missing(mktcap)
drop if mktcap<=p10
*use absolute value because CRSP denotes BID-ASK midpoint with negative sign
drop if abs(prc)<5
*Save
save "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM2.dta", replace
*Step 3: Allocate stocks in 10 portfolios and hold for 6 months
*Use new file
use "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM2.dta", clear
sort date
*We will create variable prret6, which will tell us which portfolio a stock
*belongs to based on cumret 6
*We will use command xtile puts a prespecified percent of firms into
*each portfolio
*nq() tells stata how many portfolios we want
by date: egen prret6 = xtile (cumret6), nq(10) // takes ~20min to run
*Save
save "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM3.dta", replace
*Use the portfolios
use "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM3.dta", clear
drop if missing(prret6)
*Expand data, i.e. create 6 copies of the data
expand 6
sort permno date
*Create variable n which trackswhat copy of the data it is,
*n will go from 1 to 6
*_n is the count for the dataset/ the number for each observation
by permno date: gen n=_n
*Use n variable to increment monthindex by 1
replace monthindex = monthindex+n
sort permno monthindex
*Drop return from the master dataset because we want the one from the
*using dataset
drop ret
merge m:1 permno monthindex using "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM1.dta"
keep if _merge==3
drop _merge
save "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM4.dta", replace
*Step 4: Analysis
use "/Users/dk/Desktop/USD Documents/MSF/MFIN 518/MOM4.dta", clear
sort monthindex prret6 date
*Average returns based on each portfolio in each calendar month and by
*formation month
collapse (mean) ret, by (monthindex prret6 date)
*Summarize again to get average portfolio returns by calendar month (monthindex)
collapse (mean) ret, by (monthindex prret6)
*Transpose the data
reshape wide ret, i(monthindex) j(prret6) // i(rows) j(columns)
*Generate year and month variable for clarity
gen year= round(monthindex/12)
gen month=(monthindex-year*12)+6
*create momentum return variable and check for significance
gen momret=ret10-ret1
ttest ret10=ret1
*testing momentum returns from year 2000 onward
keep if monthindex>=24000
ttest ret10=ret1
Hello currently my dates are stored as numeric in the form of 40547. How can I convert these to MMDDYY10.?
data SevenSec11;
set Seven11;
DateRecieved = input(put(DateRecieved, 8.), MMDDYY10.);
format DateRecieved MMDDYY10.;
run;
How to convert it depends on what the value represents. If it is dates as stored by Excel then change the offset value. If it is supposed to represent MMDDYY values then use the Z6. format in your PUT() function call.
data test;
input num ;
sasdate1 = num + '30DEC1899'd ;
sasdate2 = input(put(num ,z6.),mmddyy10.);
format num comma7. sasdate: yymmdd10. ;
cards;
40547
;
Result:
Obs num sasdate1 sasdate2
1 40,547 2011-01-04 1947-04-05
Note that using Y-M-D order for dates will eliminate confusion that truncated leading zeros can cause. It will also prevent half of your audience from confusing April 5th with May 4th.
I have some ECG data for a number of subjects. For each subject, I can export an excel file with the RR interval, Heart Rate and other measures. The problem is that I have a timestamp starting at the time of recording (in this case 11:22:3:00).
I need to compare the date with other subjects and I want to automate the procedure in Matlab.
I need to flexibly compare, for instance, the first 3 minutes of subjects in condition 1 with those of sbj in condition 2. Or minutes 4 to 8 of condition 1 and 2 and so forth. To do this, I am thinking that the best way is to shift the time vector for each subject so that it starts from 0.
There are a couple of problems to note: I CANNOT create just one vector for all subjects. This would be inaccurate because the heart measures are variable for each individual.
So, IN SHORT I need to shift the time vector for each participant so that it starts at 0 and increases exactly like the original one. So, in this example:
H: M: S: MS RR HR
11:22:03:000 0.809 74.1
11:22:03:092 0.803 74.7
11:22:03:895 0.768 78.1
11:22:04:663 0.732 81.9
11:22:05:395 0.715 83.9
11:22:06:110 0.693 86.5
11:22:06:803 0.705 85.1
11:22:07:508 0.706 84.9
11:22:08:214 0.749 80.1
11:22:08:963 0.762 78.7
11:22:09:725 0.766 78.3
would become:
00:00:00:0000
00:00:00:092
00:00:00:895
00:00:01:663
and so forth...
I would like to do it in Matlab...
P.S.
I was working around the idea of extracting the info in 4 different variables.
Then, I could subtract the values for each cell from the first cell.
For instance:
11-11 = 0; 22-22=0; 03-03=0; ms: keep the same value
Maybe this could kind of work, except that it wouldn't if I have a subject that started, say, at 11:55:05:00
Thank you all for any help.
Gluce
Basic timestamp normalization just subtracts the minimum (or first, assuming they're properly ordered) time from the rest.
With MATLAB's datetime object, this is just subtraction, which yields a duration object:
ts = ["11:22:03:000", "11:22:03:092", "11:22:03:895", "11:22:04:663"];
% Convert to datetime & normalize
t = datetime(ts, 'InputFormat', 'HH:mm:ss:SSS');
t.Format = 'HH:mm:ss:SSS';
nt = t - t(1);
% Reformat & display
nt.Format = 'hh:mm:ss.SSS';
Which returns:
>> nt
nt =
1×4 duration array
00:00:00.000 00:00:00.092 00:00:00.895 00:00:01.663
Alternatively, you can normalize the datetime array itself:
ts = ["11:22:03:000", "11:22:03:092", "11:22:03:895", "11:22:04:663"];
t = datetime(ts, 'InputFormat', 'HH:mm:ss:SSS');
t.Format = 'HH:mm:ss:SSS';
[h, m, s] = hms(t);
[t.Hour, t.Minute, t.Second] = deal(h - h(1), m - m(1), s - s(1));
Which returns the same:
>> t
t =
1×4 datetime array
00:00:00:000 00:00:00:092 00:00:00:895 00:00:01:663
I am trying to maintain a table using some panel data. I have all the data outputting fine, but I am having difficulty getting the correct dates to display. The method I am using is the following:
gen ymdny = date(date,"MDY"); /*<- date var from panel dataset that i import*/
sort name ymdny;
summ ymdny;
local lastdate : disp %tdM-D r(max);
local lastdate2 : disp %tdM-D (r(max)-1);
local lastw : disp %tdM-D (r(max)-7);
This would work fine if the data were daily, but the dataset I have is actually business daily (ie. missing for the weekends and bank national holidays). It seems silly but I have not been able to figure out a workaround that does the job. Ideally - there is a function that i can use to print the corresponding date to a particular value.
For example:
gen resbal_1d = round(l1.resbal,0.1);
gen dateOf = dateOf(resbal_1d); /* <- pseudocode example of what I would like */
I'm not sure what you're asking for but my guess is that you want to see a human readable form date as the output, given a numerical input. (This is your last sentence.) So simply try something like:
display %td 10
The format is important as the following shows (see help format):
display %tq 10
Same numerical input, different format, different output.
Two other examples from the manual:
* string to integer
display date("5-12-1998", "MDY")
* string to date format
display %td date("5-12-1998", "MDY")
As for your example code, I don't get what you're aiming for. In effect, you can summarize the date variable because in Stata, dates are just integers. It's legal but couldn't say if it's good form. Below a simple example.
clear all
set more off
set obs 10
gen date = _n // create the data
format date %td // give date format
list
summarize date
local onedate = r(max)
display %td `onedate'
Some references:
[U] 24 Working with dates and time
help datetime
help datetime business calendars
http://www.stata.com/support/faqs/data-management/creating-date-variables/
http://www.ats.ucla.edu/stat/stata/modules/dates.htm
(Maybe you can explain with more detail and context what it is you want.)
Edit
Your comment
I do not see how this helps with the date output. For example,
displaying r(max) - 1 on a monday will still display the sunday date.
does not explain, at all, the problems you're having with Stata's business calendars.
I'm adding what is basically an example taken from the help file I already referenced. I do this with the hope of convincing you that (re)-reading the help files is worthwhile.
*clear all
set more off
* import string dates
infile str10 sdate float x using http://www.stata-press.com/data/r13/bcal_simple
list
*----- Regular dates -----
* create elapsed dates - Stata's way of managing dates
generate rdate = date(sdate, "MD20Y")
format rdate %td
drop sdate x
list
* compute previous and next dates
generate tomorrow1 = rdate + 1
format tomorrow1 %td
generate yesterday1 = rdate - 1
format yesterday1 %td
list
*----- Business dates -----
* convert regular date to business dates
generate bdate = bofd("simple", rdate)
format bdate %tbsimple
* compute previous and next dates
generate tomorrow2 = bdate + 1
format tomorrow2 %tbsimple
generate yesterday2 = bdate - 1
format yesterday2 %tbsimple
order yesterday1 rdate tomorrow1 yesterday2 bdate tomorrow2
list
/*
The stbcal-file for simple, the calendar shown below,
November 2011
Su Mo Tu We Th Fr Sa
---------------------------
1 2 3 4 X
X 7 8 9 10 11 X
X 14 15 16 17 18 X
X 21 22 23 X X X
X 28 29 30
---------------------------
*/
Notice that if you add or substract 1 from a regular date, then business days are not taken into account. If you do the same with a business calendar date, you get what you want. Business calendars are defined by .stbcal files; the example uses a built-in calendar called simple. You maybe need to make your own .stbcal file but it is not difficult. Again, the details are in the help files.
So, I'm beginning to use timeseries in MATLAB and I'm kinda stuck.
I have a list of timestamps of events which I imported into MATLAB. It's now a 3000x25 array which looks like
2000-01-01T00:01:01+00:00
2000-01-01T00:01:02+00:00
2000-01-01T00:01:03+00:00
2000-01-01T00:01:04+00:00
As you can see, each event was recorded by date, hour, minute, second, etc.
Now, I would like to count the number of events by date, hour, etc. and then do various analyses (regression, etc.).
I considered creating a timeseries object for each day, but considering the size of the data, that's not practical.
Is there any way to manipulate this array such that we have "date: # of events"?
Perhaps there's just a simpler way to count events using timeseries?
As others have suggested, you should convert the string dates to serial date numbers. This makes it easy to work with the numeric data.
An efficient way to count number of events per interval (days, hours, minutes, etc...) is to use functions like HISTC and ACCUMARRAY. The process will involve manipulating the serial dates into units/format required by such functions (for example ACCUMARRAY requires integers, whereas HISTC needs to be given the bin edges to specify the ranges).
Here is a vectorized solution (no-loop) that uses ACCUMARRAY to count number of events. This is a very efficient function (even of large input). In the beginning I generate some sample data of 5000 timestamps unevenly spaced over a period of 4 days. You obviously want to replace it with your own:
%# lets generate some random timestamp between two points (unevenly spaced)
%# 1000 timestamps over a period of 4 days
dStart = datenum('2000-01-01'); % inclusive
dEnd = datenum('2000-01-5'); % exclusive
t = sort(dStart + (dEnd-dStart).*rand(5000,1));
%#disp( datestr(t) )
%# shift values, by using dStart as reference point
dRange = (dEnd-dStart);
tt = t - dStart;
%# number of events by day/hour/minute
numEventsDays = accumarray(fix(tt)+1, 1, [dRange*1 1]);
numEventsHours = accumarray(fix(tt*24)+1, 1, [dRange*24 1]);
numEventsMinutes = accumarray(fix(tt*24*60)+1, 1, [dRange*24*60 1]);
%# corresponding datetime range/interval label
days = cellstr(datestr(dStart:1:dEnd-1));
hours = cellstr(datestr(dStart:1/24:dEnd-1/24));
minutes = cellstr(datestr(dStart:1/24/60:dEnd-1/24/60));
%# display results
[days num2cell(numEventsDays)]
[hours num2cell(numEventsHours)]
[minutes num2cell(numEventsMinutes)]
Here is the output for the number of events per day:
'01-Jan-2000' [1271]
'02-Jan-2000' [1258]
'03-Jan-2000' [1243]
'04-Jan-2000' [1228]
And an extract of the number of events per hour:
'02-Jan-2000 09:00:00' [50]
'02-Jan-2000 10:00:00' [54]
'02-Jan-2000 11:00:00' [53]
'02-Jan-2000 12:00:00' [74]
'02-Jan-2000 13:00:00' [49]
'02-Jan-2000 14:00:00' [59]
similarly for minutes:
'03-Jan-2000 08:54:00' [1]
'03-Jan-2000 08:55:00' [1]
'03-Jan-2000 08:56:00' [1]
'03-Jan-2000 08:57:00' [0]
'03-Jan-2000 08:58:00' [0]
'03-Jan-2000 08:59:00' [0]
'03-Jan-2000 09:00:00' [1]
'03-Jan-2000 09:01:00' [2]
You can convert those timestamps to a number with datenum:
A serial date number represents the whole and fractional number of days from a specific date and time, where datenum('Jan-1-0000 00:00:00') returns the number 1. (The year 0000 is merely a reference point and is not intended to be interpreted as a real year in time.)
This way, it's easier to check where a period starts and end. Eg: the week your looking for starts at x and ends at x+7.999... ; all you have to do to find events in that period is checking if the datenum value is between x and x+8:
week_x_events = find(dn_timestamp>=x & dn_timestamp<x+8)
The difficulty is in converting your timestamp to datenum acceptable format, which is doable using regexp, good luck!
I don't know what +00:00 means (maybe time zone?), but you can simply convert your string timestamps into numerical format:
>> t = datenum('2000-01-01T00:01:04+00:00', 'yyyy-mm-ddTHH:MM:SS')
t =
7.3049e+005
>> datestr(t)
ans =
01-Jan-2000 00:01:04