I have a text file, which contains only numbers.
For example:
2001 31110
199910 311
Its layout can be explained as follows:
1~4th numbers : Year
5~6th numbers : Month
7~8th numbers : Day
9th number : Sex
10th number : Married
However, I can't decide how to import this file into Stata.
For instance, if I use the command:
import delimited input.txt, delimiter(??)
What should I write in delimiter?
I don't necessarily need to use the above. I just want to import the data using whatever method.
The answer depends on what you want to do with the data later.
My understanding is that the spaces indicate a single digit for date-related numbers and that in the text file, only month or day can be single digit but not both. In addition, sex and married are binary indicators taking values 0 and 1.
Assuming the above are correct and the data below are included in a file data.txt:
2001 31110
199910 311
1983 41201
2012121500
Here's one way to do it:
clear
import delimited data.txt, delimiter(" ") stringcols(_all)
list
+--------------------+
| v1 v2 |
|--------------------|
1. | 2001 31110 |
2. | 199910 311 |
3. | 1983 41201 |
4. | 2012121500 |
+--------------------+
replace v2 = "0" + v2 if v2 != ""
generate v3 = v1 + v2
generate year = substr(v3, 1, 4)
generate month = substr(v3, 5, 2)
generate day = substr(v3, 7, 2)
generate date = substr(v3, 1, 8)
generate sex = substr(v3, 9, 1)
generate married = substr(v3, 10, 1)
list
+----------------------------------------------------------------------------------+
| v1 v2 v3 year month day date sex married |
|----------------------------------------------------------------------------------|
1. | 2001 031110 2001031110 2001 03 11 20010311 1 0 |
2. | 199910 0311 1999100311 1999 10 03 19991003 1 1 |
3. | 1983 041201 1983041201 1983 04 12 19830412 0 1 |
4. | 2012121500 2012121500 2012 12 15 20121215 0 0 |
+----------------------------------------------------------------------------------+
You basically import everything in a maximum of two string variables, with a single space " " acting as a separator. The single-digit months or days are changed to two digits by adding a 0 at the front. Then, after you extract the relevant parts of the strings using the substr() function, you can simply convert the resulting variables to numeric as needed.
For example:
destring year month day sex married, replace
generate date2 = daily(date, "YMD")
format date2 %tdDD-NN-CCYY
. list date2
+------------+
| date2 |
|------------|
1. | 11-03-2001 |
2. | 03-10-1999 |
3. | 12-04-1983 |
4. | 15-12-2012 |
+------------+
If in your text file both month and day contain single digits, you follow the same logic as above but you will need to deal with a third variable as well after you import the data.
Related
My date data was pulled from our system with the format "mm/dd" not the year.
So I meet the problem when I subtract value between the old year and the current year.
Example:
Date action Date Check Result Current Date
12/21 01/03 -352 03/18/2022
The correct result is:
Date action Date Check Result Current Date
12/21 01/03 13 03/18/2022
How to subtract correctly? Thanks.
I assume that Excel has treated 12/21 and 01/03 as dates, but in doing this has assumed the current year in all cases. This dedcuts 1 year from cell A2
=DATE(YEAR(A2)-1,MONTH(A2),DAY(A2))
e.g.
Test if the earlier date has a month less than the latter date, if it is not then deduct 1 year from the earlier date and then calculate the difference
+------------+------------+------+
| A | B | C |
1 | earlier | latter | diff |
+------------+------------+------+
2 | 2022-12-21 | 2022-01-03 | 13 |
+------------+------------+------+
in cell C2
=IF(MONTH(A2) > MONTH(B2),B2-DATE(YEAR(A2)-1,MONTH(A2),DAY(A2)),B2-A2)
I'm trying to create a google form in which i have to ask consumption data for each month in a certain period. So I'd like to format the output file like so: the rows will be 12 (Jan, Feb, Mar, Apr, etc...) and the columns 3 (F1, F2, F3). And I want to populate these 12x3 fields for every submission.
Now, Google form saves each submission in one row: I'd have 12x3 columns this way. I'd like to know if there's a way to arrange the entry data in a table (12x3) instead of 12x3 columns.
So the result would be:
| name | months | F1 | F2 | F3 |
|:---- |:------:| -----:|-----: |-----: |
|example|Jan|50|50|20|
| example | Feb | 60 |30|10|
| example | Mar | 50 |90|70|
| ... | ... | ... |...|...|
And the last row would be: example; Dec; number1, number2, number3
Thanks in advance
I have a dataset with a column for Date that looks like this:
| Date | Another column |
| -------- | -------------- |
| 1.2019 | row1 |
| 2.2019 | row2 |
| 11.2018 | row3 |
| 8.2021 | row4 |
| 6.2021 | row5 |
The Date column is interpreted as a float dtype but in reality 1.2019 means month 1 - that is, january - of the year 2019. I changed it to string type and it worked well, at least it seems so. But I want to plot this data against the total count of something, which is the column 2 of the dataset, but when I plot it:
the x-axis is not ordered. Well, why would it be? There is no ordered relationship between the string 1.2019 and 2.2019: there is no way to know the first is january of 2019 and the second one is february. I thought of using regex, or even mapping 1.2019 to jan-2019 but the problem persists: strings with no date ordered relationship. I know there is the datetime method but I don't know if this would help me.
How can I proceed? it is probably very easy, but I am stucked here!
Convert to datetime with pandas.to_datetime:
df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%m.%Y')
or if you have a pandas version that refuses to convert if the day is missing:
pd.to_datetime('1.'+df['Date'].astype(str), format='%d.%m.%Y')
output:
Date Another column
0 2019-01-01 row1
1 2019-02-01 row2
2 2018-11-01 row3
3 2021-08-01 row4
4 2021-06-01 row5
Given a table that has a column of time ranges e.g.:
| <2015-10-02>--<2015-10-24> |
| <2015-10-05>--<2015-10-20> |
....
how can I create a column showing the results of org-evalute-time-range?
If I attempt something like:
#+TBLFM: $2='(org-evaluate-time-range $1)
the 2nd column is populated with
Time difference inserted
in every row.
It would also be nice to generate the same result from two different columns with, say, start date and end date instead of creating one column of time ranges out of those two.
If you have your date range split into 2 columns, a simple subtraction works and returns number of days:
| <2015-10-05> | <2015-10-20> | 15 |
| <2013-10-02 08:30> | <2015-10-24> | 751.64583 |
#+TBLFM: $3=$2-$1
Using org-evaluate-time-range is also possible, and you get a nice formatted output:
| <2015-10-02>--<2015-10-24> | 22 days |
| <2015-10-05>--<2015-10-20> | 15 days |
| <2015-10-22 Thu 21:08>--<2015-08-01> | 82 days 21 hours 8 minutes |
#+TBLFM: $2='(org-evaluate-time-range)
Note that the only optional argument that org-evaluate-time-range accepts is a flag to indicate insertion of the result in the current buffer, which you don't want.
Now, how does this function (without arguments) get the correct time range when evaluated is a complete mystery to me; pure magic(!)
I have a dataset with two variables: journalistName, articleDate
For each journalist (group), I want to create a variable that chronologically categorizes the articles into 1 for "first half" and 2 for "second half".
For example, if a journalist wrote 4 articles, I want first two articles categorized as 1.
If he wrote 5 articles, I want first three articles categorized as 1.
One possibility I thought of is to calculate the midpoint date and then use if condition (gen cat1 = 1 if midpoint > startdate) but I dont know how to generate such midpoint in Stata.
Per your description of which articles to categorize as 1, you're looking for the midpoint of the number of articles rather than the midpoint of the date range.
One solution is to use by group processing, _n, and _N:
gen cat = 2
bysort author (date): replace cat = 1 if _n <= ceil(_N/2)
This sorts by author and date, and then assigns cat = 1 to observations within each group of author where the current observation (_n) number is less than or equal to the median observation (ceil(_N/2)).
Note that you need a numeric (rather than string) date for the sort to work properly. Also, in my opinion, cat = {1,2} is less intuitive than something like firsthalf = {0,1}. Either way, labeling the values (help label) would aid clarity.
For more information, see help by and this article.
Finally, the method in action:
clear all
input str10 author str10 datestr
"Alex" "09may2015"
"Alex" "06apr2015"
"Alex" "15jul2014"
"Alex" "19aug2013"
"Alex" "03mar2009"
"Betty" "09may2015"
"Betty" "06apr2015"
"Betty" "15jul2014"
"Betty" "19aug2013"
end
gen date = daily(datestr, "DMY")
format date %td
gen cat = 2
bysort author (date): replace cat = 1 if _n <= ceil(_N/2)
list , sepby(author) noobs
and the result
+--------------------------------------+
| author datestr date cat |
|--------------------------------------|
| Alex 03mar2009 03mar2009 1 |
| Alex 19aug2013 19aug2013 1 |
| Alex 15jul2014 15jul2014 1 |
| Alex 06apr2015 06apr2015 2 |
| Alex 09may2015 09may2015 2 |
|--------------------------------------|
| Betty 19aug2013 19aug2013 1 |
| Betty 15jul2014 15jul2014 1 |
| Betty 06apr2015 06apr2015 2 |
| Betty 09may2015 09may2015 2 |
+--------------------------------------+
If you are indeed seeking to calculate the midpoint date, you can do so using the same general principle:
bysort author (date): gen beforemiddate = date <= ceil((date[_N] + date[1]) / 2)
Also, to find the last date in the "pre-midpoint" period, you can use the same principles:
bysort author cat (date): gen lastdate = date[_N] if cat == 1
by author: replace lastdate = lastdate[_n-1] if missing(lastdate)
format lastdate %td
or an egen function with a logical test included gets the job done a bit faster:
egen lastdate = max(date * (cat == 1)) , by(author)
format lastdate %td