What am I doing wrong while importing the following data into sas - import

I am trying to import certain data into my SAS datset using this piece of code:
Data Names_And_More;
Infile 'C:\Users\Admin\Desktop\Torrent Downloads\SAS 9.1.3 Portable\Names_and_More.txt';
Input Name & $20.
Phone : $20.
Height & $10.
Mixed & $10.;
run;
The data in the file is as below:
Roger Cody (908)782-1234 5ft. 10in. 50 1/8
Thomas Jefferson (315)848-8484 6ft. 1in. 23 1/2
Marco Polo (800)123-4567 5Ft. 6in. 40
Brian Watson (518)355-1766 5ft. 10in 89 3/4
Michael DeMarco (445)232-2233 6ft. 76 1/3
I have been trying to learn SAS and while going through Ron Cody's book Learning SAS by example,I found to import the kind of data above, we can use 'the ampersand (&) informat modifier. The ampersand, like the colon,says to use the supplied informat, but the delimiter is now two or more blanks instead of just one.' (Ron's words, not mine). However, while importing this the result (dataset) is as follows:
Name Phone Height Mixed
Roger Cody (908)782- Thomas Jefferson Marco Polo
Also, for further details the SAS log is as follows:
419 Data Names_And_More;
420 Infile 'C:\Users\Admin\Desktop\Torrent Downloads\SAS 9.1.3 Portable\Names_and_More.txt';
421 Input Name & $20.
422 Phone : $20.
423 Height & $10.
424 Mixed & $10.
425 ;run;
NOTE:
The infile 'C:\Users\Admin\Desktop\Torrent Downloads\SAS 9.1.3 Portable\Names_and_More.txt' is:
File Name=C:\Users\Admin\Desktop\Torrent Downloads\SAS 9.1.3 Portable\Names_and_More.txt,
RECFM=V,LRECL=256
NOTE:
LOST CARD.
Name=Brian Watson (518)35 Phone=Michael Height=DeMarco (4 Mixed= ERROR=1 N=2
NOTE: 5 records were read from the infile 'C:\Users\Admin\Desktop\Torrent Downloads\SAS 9.1.3
Portable\Names_and_More.txt'.
The minimum record length was 37.
The maximum record length was 47.
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.NAMES_AND_MORE has 1 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.17 seconds
cpu time 0.14 seconds
I am looking for some help with this one. It'd be great if someone can explain what exactly is happening, what am I doing wrong and how to correct this error.
Thanks

The answer is in the explanation in Ron Cody's book. & means you need two spaces to separate varaibles; so you need a second space after the name (and other fields with &).
Wrong:
Roger Cody (908)782-1234 5ft. 10in. 50 1/8
Right:
Roger Cody (908)782-1234 5ft. 10in. 50 1/8

Related

How to import dates correctly from this .csv file into Matlab?

I have a .csv file with the first column containing dates, a snippet of which looks like the following:
date,values
03/11/2020,1
03/12/2020,2
3/14/20,3
3/15/20,4
3/16/20,5
04/01/2020,6
I would like to import this data into Matlab (I think the best way would probably be using the readtable() function, see here). My goal is to bring the dates into Matlab as a datetime array. As you can see above, the problem is that the dates in the original .csv file are not consistently formatted. Some of them are in the format mm/dd/yyyy and some of them are mm/dd/yy.
Simply calling data = readtable('myfile.csv') on the .csv file results in the following, which is not correct:
'03/11/2020' 1
'03/12/2020' 2
'03/14/0020' 3
'03/15/0020' 4
'03/16/0020' 5
'04/01/2020' 6
Does anyone know a way to automatically account for this type of data in the import?
Thank you!
My version: Matlab R2017a
EDIT ---------------------------------------
Following the suggestion of Max, I have tried specifiying some of the input options for the read command using the following:
T = readtable('example.csv',...
'Format','%{dd/MM/yyyy}D %d',...
'Delimiter', ',',...
'HeaderLines', 0,...
'ReadVariableNames', true)
which results in:
date values
__________ ______
03/11/2020 1
03/12/2020 2
NaT 3
NaT 4
NaT 5
04/01/2020 6
and you can see that this is not working either.
If you are sure all the dates involved do not go back more than 100 years, you can easily apply the pivot method which was in use in the last century (before th 2K bug warned the world of the danger of the method).
They used to code dates in 2 digits only, knowing that 87 actually meant 1987. A user (or a computer) would add the missing years automatically.
In your case, you can read the full table, parse the dates, then it is easy to detect which dates are inconsistent. Identify them, correct them, and you are good to go.
With your example:
a = readtable(tfile) ; % read the file
dates = datetime(a.date) ; % extract first column and convert to [datetime]
idx2change = dates.Year < 2000 ; % Find which dates where on short format
dates.Year(idx2change) = dates.Year(idx2change) + 2000 ; % Correct truncated years
a.date = dates % reinject corrected [datetime] array into the table
yields:
a =
date values
___________ ______
11-Mar-2020 1
12-Mar-2020 2
14-Mar-2020 3
15-Mar-2020 4
16-Mar-2020 5
01-Apr-2020 6
Instead of specifying the format explicitly (as I also suggested before), one should use the delimiterImportoptions and in the case of a csv-file, use the delimitedTextImportOptions
opts = delimitedTextImportOptions('NumVariables',2,...% how many variables per row?
'VariableNamesLine',1,... % is there a header? If yes, in which line are the variable names?
'DataLines',2,... % in which line does the actual data starts?
'VariableTypes',{'datetime','double'})% as what data types should the variables be read
readtable('myfile.csv',opts)
because the neat little feature recognizes the format of the datetime automatically, as it knows that it must be a datetime-object =)

Q: SAS: taking data in dd.m.yyyy format from a csv

I need to import data from a csv-file. And I'm able to read everything else but the date. The date format is like dd.m.yyyy format: 6;Tiku;17.1.1967;M;191;
I'm guessing if I need to specify an informat to read it in? I can't figure out which one to use because nothing I've tried works.
What I've managed so far:
data [insert name here];
infile [insert name here];
dlm=";" missover;
length Avain 8 Nimi $10 Syntymapaiva 8 Sukupuoli $1 Pituus 8 Paino 5;
input
Avain Nimi $ Syntymapaiva ddmmyyp.(=this doesnt work) Sukupuoli$ Pituus
Paino;
format Paino COMMA5.2 ;
label Syntymapaiva="Syntymäpäivä";
run;
And part of the actual file to read in:
6;Tiku;17.1.1967;M;191;
Thank you for helping this doofus out!
There is no informat named DDMMYYP.. Use the informat DDMMYY. instead.
Also make sure to use the : modifier before the informat specification included in the INPUT statement so that you are still using list mode input instead of formatted input. If you use formatted input instead of list mode input then SAS could read past the delimiter.
input Avain Nimi Syntymapaiva :ddmmyy. Sukupuoli Pituus Paino;
Perhaps you are confused because there is a format named DDMMYYP.
Formats are used to convert values to text. Informats are what you need to use when you want to convert text to values.
553 options nofmterr ;
554 data _null_;
555 str='17.1.1967';
556 ddmmyy = input(str,ddmmyy10.);
557 ddmmyyp = input(str,ddmmyyp10.);
----------
485
NOTE 485-185: Informat DDMMYYP was not found or could not be loaded.
558 put str= (dd:) (= yymmdd10.);
559 _error_=0;
560 run;
NOTE: Invalid argument to function INPUT at line 557 column 13.
str=17.1.1967 ddmmyy=1967-01-17 ddmmyyp=.
NOTE: Mathematical operations could not be performed at the following places. The results of the operations have been set to
missing values.
Each place is given by: (Number of times) at (Line):(Column).
1 at 557:13
You could use the anydtdte informat, but (as #Tom points out) if your data is known to be fixed in this format, then ddmmyy. would be be better. Also, Tom's advice about using the : modifier is correct, and is preferable to use in most (if not all) cases.
data want;
infile cards dlm=";" missover;
input Avain Nimi:$10. Syntymapaiva:ddmmyy. Sukupuoli:$1. Pituus Paino;
format Paino COMMA5.2 Syntymapaiva date9.;
label Syntymapaiva="Syntymäpäivä";
datalines4;
6;Tiku;17.1.1967;M;191;
;;;;
run;
which gives:

How to import .txt with characters and numbers in Sas

I want to import a .txt file in SAS.
Here what looks like my data :
annee manufacturier modele categorie cylindree cylindres transmission ville ...
2016 Ford Focus 1 1.8 5 Manual 10.1
2016 Toyota Tercel 3 1.4 3 Auto 7.1
Here is my code
data car;
infile "C:\Users\Mark\Desktop\sas\car.txt"
LRECL=10000000 DLM=" " firstobs=2 ;
input
annee manufacturier modele categorie cylindree cylindres transmission type ville route combine emissiond indice
;
run;
But, when I run it, I have a lot of " Invalid data for ... " and then I end up with very few data in my table in SAS and lots of missing ones.
Some variables are numbers and some are characters. I feel like the problem is there.
How I could import that type of file ?
Thank you
A text file does not have any intrinsic data type. Everything is character, until you explicitly tell SAS the data type of your columns. Also, sometimes you need to tell SAS the input format, or informat, of your data.
Sometimes SAS is smart enough to guess correctly re: your data informat. For example, the below code generates the same results if you delete the informat statement. But, this would not be the case for say dates. In general, explicitly specifying the informat is best practice.
If your data was delimited, such as CSV, you could use PROC IMPORT to import your data. Using PROC IMPORT, SAS will make a best guess as to the data type based on the content of the columns (like Excel does when it imports text data).
The below code will import the data you specified:
filename temp temp;
data _null_;
infile datalines;
file temp;
input;
put _infile_;
datalines;
annee manufacturier modele categorie cylindree cylindres transmission ville
2016 Ford Focus 1 1.8 5 Manual 10.1
2016 Toyota Tercel 3 1.4 3 Auto 7.1
run;
data want;
infile temp firstobs=2;
length
annee 8
manufacturier $20
modele $20
categorie 8
cylindree 8
cylindres 8
transmission $20
ville 8
;
informat
cylindree 8.1
ville 8.1
;
input
annee
manufacturier
modele
categorie
cylindree
cylindres
transmission
ville
;
run;
If your data contained spaces, for example manufacturier = Mercedes Benz, then you would need to use an informat (eg. $char20.) for that column as well.

Reading fixed width file - spaces are recognized

I am trying to read this fixed width data into SAS:
John Garcia 114 Maple Ave.
Sylvia Chung 1302 Washington Drive
Martha Newton 45 S.E. 14th St.
I used this code:
libname mysas 'c:\users\LELopez243\mysas';
filename address 'c:\users\LELopez243\mysas\address.dat';
data mysas.address2;
infile address;
input Name $ 1-15 Number 16-19 Street $ 22-37;
run;
proc print data=mysas.address2;
run;
Got this result:
Obs Name Number Street
1 John Garcia 114 Sylvia Chung 1
2 Martha Newton 45
If I edit the .dat file and manually add spaces at the end of each line until they are each the same length, the code works. Any ideas for code that takes into account differing line lengths (w/o manually entering spaces).
Add the truncover option to your infile statement.
Truncover overrides the default behavior of the INPUT statement when an input data record is shorter than the INPUT statement expects. By default, the INPUT statement automatically reads the next input data record. TRUNCOVER enables you to read variable-length records when some records are shorter than the INPUT statement expects. Variables without any values assigned are set to missing.
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000146932.htm
libname mysas 'c:\users\LELopez243\mysas';
filename address 'c:\users\LELopez243\mysas\address.dat';
data mysas.address2;
infile address truncover;
input Name $ 1-15 Number 16-19 Street $ 22-37;
run;
proc print data=mysas.address2;
run;

Using sed to copy data between two numerical patterns to a new file

I'm running a bunch (~320) computational chemistry experiments and I need to pull a small amount of the data out of each of the files so that I can do some work on it in MatLab.
I'm pretty sure I can use sed to make this work, but try as I might I don't seem to be able to do so.
I need all of the data starting at the line beginning with "1 1" and ending with the line starting with "33 33".
I J FI(I,J) k(I,J) K(I,J)
1 1 -337.13279 -0.06697 -0.00430
2 2 3804.89120 8.52972 0.54787
3 3 3195.69653 6.01702 0.38648
4 4 3189.18684 5.99253 0.38490
5 5 3183.73262 5.97205 0.38359
6 6 3174.47525 5.93737 0.38136
7 7 3167.88746 5.91275 0.37978
8 8 1628.80868 1.56311 0.10040
9 9 1623.56055 1.55306 0.09975
10 10 1518.21620 1.35806 0.08723
11 11 1476.93012 1.28520 0.08255
12 12 1341.24087 1.05990 0.06808
13 13 1312.30373 1.01466 0.06517
14 14 1264.73004 0.94242 0.06053
15 15 1185.62592 0.82822 0.05320
16 16 1175.54013 0.81419 0.05230
17 17 1170.41211 0.80710 0.05184
18 18 1090.20196 0.70027 0.04498
19 19 1039.29190 0.63639 0.04088
20 20 1015.00116 0.60699 0.03899
21 21 1005.05773 0.59516 0.03823
22 22 986.55965 0.57345 0.03683
23 23 917.65537 0.49615 0.03187
24 24 842.93089 0.41863 0.02689
25 25 819.00146 0.39520 0.02538
26 26 758.39720 0.33888 0.02177
27 27 697.11173 0.28632 0.01839
28 28 628.75684 0.23292 0.01496
29 29 534.75856 0.16849 0.01082
30 30 499.35579 0.14692 0.00944
31 31 422.01320 0.10493 0.00674
32 32 409.30255 0.09870 0.00634
33 33 227.12411 0.03039 0.00195
33 2nd derivatives larger than 0.371D-04 over 561
MatLab is not a fan of text, so I'd like to not use text delimiters (though there are some in the header of this data section) and keep the data contained to only the numeric lines.
The data files contain a lot of other numbers as well, so I need to match the occurrence of "1 1" at the start of the line and "33 33" as the end of the copy. These 'indices' exist only in this block of info.
I attempted to use
% sed -n /"1 1"/,/"33 33"/p input.file > output.file
But I get a WHOLE BUNCH of data in the output file as it copies everything that shows up between any "1" and "33"
Is there any way to do what I'm looking for?
Also, I'm using the tcsh as that is what my servers run.
How about using awk
awk '$1=="1"&&$2=="1"{t=1};t;$1=="33"&&$2=="33"{t=0}' file
Recommand by #mklement0, if there is only one block, to avoid processing the remainder of the file you can update the command to:
awk '$1=="1"&&$2=="1"{t=1};t;$1=="33"&&$2=="33"{exit}' file
Your problem is twofold. First, there are two blanks between the ones, but your regex only allows for one (judging from the now indented code). Second, you are probably not precise enough; the /1 1/ pattern matches 11 11, for example, and 111 111 and so on.
So, you should consider:
sed -n -e '/^ *1 *1 /,/^33 *33 /p' -e '/^33 33 /q' input.file > output.file
The patterns are anchored to the start of line by the ^ (caret). The numbers are separated by one or more blanks (there are other, longer-winded ways of writing that in standard sed; the + option is not standard sed but is widely available). And the numbers are terminated by a blank. The chances are that the first expression alone will give you what you want. The second expression terminates the search early when it recognizes the 33 33 input line, which can save a significant amount of file I/O and hence processing time if the input file is big enough.
If the lines with ID numbers in the hundreds have some different format, then it should be fairly straight-forward to tweak the regexes to match what is used. If the data contains tabs instead of (or as well as) blanks, you can tweak the regexes to manage that, too.
If you data is all formatted exactly the same as this file, then you can use sed to just read the 3rd through the 35th line (rows 1 1 - 33 33). This is a lot easier than parsing the values, but does require that the files have a standard format:
sed -n 3,35p data.txt
Another cheap way would be to grep for only numeric lines, and take only the first 33:
grep "^[0-9 ][0-9 .-]*$" data.txt | head -n 33