I have an Excel sheet as input for Stata. In the Excel, a dot in a cell marks a missing value, e.g.:
Column1 Column2
1 10
2 .
. 13
. 15
3 .
However, when importing the Excel to Stata, both columns above are identified as a String.
How can I tell Stata during the import that all dots should be recognized as missing values and thus my numeric columns remain numeric, although they include some dots/missing values?
Presuming you might be importing from Excel or a csv.
Excel
From the import excel guidance:
If the column contains at least one cell with nonnumerical text, the entire column is imported as a string variable.
So the easiest solution is:
destring the variables. You can destring a whole list in one go via:
destring var_1 var_2 var_3, replace
That will overwrite the variables as numeric variables and the . will be coded as missing.
Importing a CSV
As in Excel if there are non-numeric characters I believe Stata will think it might be a string. You could use the numericcols option when importing
import delimited, numericcols()
Then whatever columns you specify in the numericcols option are forced to be numeric and the . should be interpreted as missing.
Equally easy would just be still to destring as outlined above.
Related
I have some data in a text file in the following format:
1079,40,011,1,301 17,310 4,668 6,680 1,682 1,400 7,590 2,591 139,592 332,565 23,568 2,569 2,595 1,471 1,470 10,481 12,540 117,510 1,522 187,492 9,533 41,558 15,555 12,556 9,558 27,546 1,446 1,523 4000,534 2000,364 1,999/
1083,40,021,1,301 4,310 2,680 1,442 1,400 2,590 2,591 90,592 139,595 11,565 6,470 2,540 66,522 4,492 1,533 19,546 3,505 1,523 3000,534 500,999/
These examples represent what would be two rows in a spreadsheet. The first four values (in the first example, "1079,40,011,1") each go into their own column. The rest of the data are in a paired format, first listing a name of a column, designated by a number, then a space followed by the value that should appear in that column. So again, example: 301 17,310 4,668 6: in this row, column 301 has a value of 17, column 310 has value of 4, column 668 has value of 6, etc. Then 999/ indicates an end to that row.
Any suggestions on how I can transform this text file format into a usable spreadsheet would be greatly appreciated. There are thousands of "rows" and so can't just manually convert them and I don't possess the coding skills to execute such a transformation myself.
This is messy but since there is a pattern it should be doable. What software are you using?
My first idea would be to identify when the delimeter changes from comma to space. Is it based on a fixed width, like always after 14 characters? Or is it based on the delimiter, like it is always after the 4th comma?
Once you've done that, you could make two passes at the data. The first pass imports the first four values from the beginning of the line which are separated by comma. The second pass imports the remaining values which are separated by space.
If you include a row number when importing you can then use it to join first and second passes at importing.
How to I keep variables in separate columns when using proc import with a tab delimited txt file? Only one variable is created called Name__Gender___Age. Is it only possible with the data step?
This is the code
proc import datafile= '/folders/myfolders/practice data/IMPORT DATA/class.txt'
out=new
dbms=tab
replace;
delimiter='09'x;
run;
You told PROC IMPORT that your text file had tabs between the fields. From the name of the variable it created it is most likely that instead your file just has spaces between the fields. And multiple spaces so that the lines look neatly aligned when viewed with a fixed width font.
Just write your own data step to read the file (something you should do anyway for text files).
data mew;
infile '/folders/myfolders/practice data/IMPORT DATA/class.txt' firstobs=2 truncover;
length Name $30 Gender $6 Age 8 ;
input name gender age;
run;
If there are missing values for either NAME or GENDER that are not entered as a period then you probably will want to read it using formatted or column mode input instead the simple list mode input style above.
The data file appears to have space delimiters instead of tab, contrary to your expectations.
Because you specified tab delimiting, the spaces in the header row are considered part of the column named Name Gender Age. Because spaces are not allowed in SAS column names (default setting), the spaces were converted to underscores. That is why you ended up with Name__Gender___Age
Change the delimiter to space and you should be able to import.
If the data file has a mix of space and tab delimiting, you will want to edit the data file to be consistent.
My matlab code for dataimport is giving me different results for what appear to be similar text files as input. Input1 gives me a normal cell with all lines from the text file as entries in the cell which i can reference using {i}.
Input2 gives me a scalar data structure where all numeric entries in my text file are converted to the input.data structure. I want all files to be converted to regular cell entries and I do not understand why for some files they are converted to scalar data structures.
Code: input = importdata(strcat(direct,'\',filename));
Input1 example: Correctly working dataimport, with text file on the right
File link: https://drive.google.com/open?id=1aHK4xivqEqJEmaA8Dv8Y0uW5giG-Bbip
Input2 example: Incorrectly working data import, with text file on the right FIle link: https://drive.google.com/open?id=1nzUj_wR1bNXFcGaSLGva6uVsxrk-R5vA
UTSL!
I'm guessing you are using GNU Octave although you are writing "Matlab" as topic of your question.
In importdata.m around line 178, the code tries to automatically detect the delimiter for your data:
delim = regexpi (row, '[-+\d.e*ij ]+([^-+\de.ij])[-+\de*.ij ]','tokens', 'once');
If you run this against W40A0060; you get A as delimiter because there is basically a number before and after it.
If you run this against W39E0016; you get {} as delimiter(empty) because the E could be part of a number in scientific notation and is therefore excluded.
Solution:
you really should add the correct delimiter to the importdata call and not trust that it's magically detected.
And if you just want the lines in a cell, use
strsplit (fileread ("W39E0016_Input2.txt"), "\n")
Analysis
This looks indeed strange!
EDIT: The cause for this strange looking behaviour has been deciphered by #Andy (See his solution).
When you use all outputs of importdata() function you can see what happens when reading the data:
[dat1,del1,headerrows1]=importdata('Input1.txt')
[dat2,del2,headerrows2]=importdata('Input2.txt')
For your first file it recognizes 69 header riws and no delimiter:
del1 = []
headerrows1 = 69
while in your second file only two header rows and a comma , delimiter is recognized
del2 = ','
headerrows2 = 2
I can not find an obvious reason in your files causing this different interpretation of data.
Suggestion
Your data format is rather complex. It is not a simple table like produced from excel. It has multiple lines with a different number of fields per line and varying data types. importdata() is not designed for this type of data. I suggest to write a specific import function for this kind of file. Have a look at textread() for a first guess. You can use it to read the lines of the files as text and later interpret it with sscanf() or use strsplit() to split the line contents into fields.
Sample data and required excel image:
Also, Read Time section as shown in file, and populate excel file with the data in a column with the header name Time as shown above. Likewise, read the message value as shown in the .asc file and populate in excel file by converting the numbers from hexadecimal to decimal in columns named Data1, Data2, Data3,…
If your '.asc' file consists of tab delimited ASCII text then Excel will allow you to import it into an Excel worksheet.
The following explainer comes from Microsoft's Office support site:
There are two ways to import data from a text file by using Microsoft
Excel: You can open the text file in Excel, or you can import the text
file as an external data range. To export data from Excel to a text
file, use the Save As command.
There are two commonly used text file formats:
Delimited text files (.txt), in which the TAB character (ASCII
character code 009) typically separates each field of text.
Comma separated values text files (.csv), in which the comma character
(,) typically separates each field of text.
You can change the separator character that is used in both delimited
and .csv text files. This may be necessary to make sure that the
import or export operation works the way that you want it to.
If neither of those methods work for you and your '.asc' was generated by MATLAB then you may be able to use MATLAB to export directly to an Excel worksheet. MATLAB has a function xlswrite that you can use to write directly to a Microsoft Excel spreadsheet.
Another option, if you're comfortable writing some VBA code in Excel, is to use the textscan function to parse your '.asc' file.
I want to import a delimited text file into Stata. Some of the fields are numeric where the numbers are formatted with commas ( i.e 2,144.20). When I specify a numeric data type in the infix command for these columns, the values will be imputed to missing.
infix 2 first str id 2-15 double amount 16-25 using "{datasetname}"
Is there a way to specify the numeric format (e.g %20.2fc) so that Stata does not treat them as non-numeric? Another way is to import it as string and convert it to numeric later. But I want to see if there is a way to specify the format in the infix command itself.
There is no such syntax. It would not even make sense from a Stata point of view as a format such as %20.2fc is a display format and controls what is shown (output), not what is read in (input).
Use destring, ignore(",") replace to fix such variables after reading them in.