Reading fixed width file - spaces are recognized - import

I am trying to read this fixed width data into SAS:
John Garcia 114 Maple Ave.
Sylvia Chung 1302 Washington Drive
Martha Newton 45 S.E. 14th St.
I used this code:
libname mysas 'c:\users\LELopez243\mysas';
filename address 'c:\users\LELopez243\mysas\address.dat';
data mysas.address2;
infile address;
input Name $ 1-15 Number 16-19 Street $ 22-37;
run;
proc print data=mysas.address2;
run;
Got this result:
Obs Name Number Street
1 John Garcia 114 Sylvia Chung 1
2 Martha Newton 45
If I edit the .dat file and manually add spaces at the end of each line until they are each the same length, the code works. Any ideas for code that takes into account differing line lengths (w/o manually entering spaces).

Add the truncover option to your infile statement.
Truncover overrides the default behavior of the INPUT statement when an input data record is shorter than the INPUT statement expects. By default, the INPUT statement automatically reads the next input data record. TRUNCOVER enables you to read variable-length records when some records are shorter than the INPUT statement expects. Variables without any values assigned are set to missing.
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000146932.htm
libname mysas 'c:\users\LELopez243\mysas';
filename address 'c:\users\LELopez243\mysas\address.dat';
data mysas.address2;
infile address truncover;
input Name $ 1-15 Number 16-19 Street $ 22-37;
run;
proc print data=mysas.address2;
run;

Related

Importing txt file in SAS

I tried to import text file in sas with the following code
PROC IMPORT DATAFILE= '/home/u44418748/MSc Biostatistics with SAS/Datasets/school.txt'
OUT= outdata
DBMS=dlm
REPLACE;
delimiter='09'x;
GETNAMES=YES;
RUN;
But I am getting import unsuccessful because text file has period for missing data
this is what i got in SAS log
NOTE: Invalid data for class_size in line 455 16-17.
455 CHAR 454.34.8.32.17.NA.23.125.12.188 31
ZONE 3330330303303304403323330332333
NUMR 454934989329179E1923E125912E188
sl_no=454 school=34 iq=8 test=32 ses=17 class_size=. meanses=23.125 meaniq=12.188 _ERROR_=1 _N_=454
how can load this text file in SAS
Did you create that text file from R? That package has a nasty habit of putting text values of NA for numeric values into text files. If you are the one that created the file the you might check if the system you are using has a way to not put the NA into the file to begin with. In a delimited file missing values are normally represented by having nothing for the field. So the delimiters are right next to each other. For SAS you can use a period to represent a missing value.
I wouldn't bother to use PROC IMPORT to read a delimited file. Just write a data step to read the file. Since it looks like your file only has six variables and they are all numeric the code is trivial.
data outdata;
infile '/home/u44418748/MSc Biostatistics with SAS/Datasets/school.txt'
dsd dlm='09'x firstobs=2 truncover
;
input sl_no school iq test ses class_size meanses meaniq ;
run;
One way to deal with the NA text in the input file is to replace them with periods. Since all of the fields are numeric you can do that easily because you don't have to worry about replacing real text that just happens to have the letter A after the letter N. Here is trick using the _INFILE_ automatic variable that you can use to make the change on the fly while reading the file.
data outdata;
infile '/home/u44418748/MSc Biostatistics with SAS/Datasets/school.txt'
dsd dlm='09'x firstobs=2 truncover
;
input #;
_infile_=tranwrd(_infile_,'NA','.');
input sl_no school iq test ses class_size meanses meaniq ;
run;
You are getting the NOTE: because of the NA value in the class_size field.
What you presume are periods (.) are actually tabs (hex code 09). Look under the period to confirm, the ZONE is 0 and NUMR 9. 09 is the tab character.
Proc IMPORT guesses each fields data type based on looking at the first few rows (default is 20 rows) of a text file. Your file contained only numbers the 20 rows, so the procedure guessed class_size was numeric.
There a couple of courses of action.
Do nothing. Read your log NOTES and know the places where NA occurred you will have a missing value in your data set.
or,Read the file as-is, but add GUESSINGROWS=MAX; statement to your import code
The mixed data type column class_size will be guessed as character and you might have to do another step to convert the values to numeric (a step in which the non-digit values get converted to missing values)
or, Edit the text file replacing all the NA with a period (.). The dot marks a missing value during IMPORT. The IMPORT step will have no incongruities to LOG about.
Converting a field
PROC IMPORT DATAFILE= '/home/u44418748/MSc Biostatistics with SAS/Datasets/school.txt'
DBMS=dlm REPLACE OUT=work.outdata;
delimiter='09'x;
GETNAMES=YES;
GUESSINGROWS=MAX;
RUN;
data want;
set outdata (rename=(class_size=class_size_char));
class_size = input (class_size_char, ?? best12.);
drop class_size_char;
run;

multi-character separator in `set datafile separator "|||"` doesn't work

I have an input file example.data with a triple-pipe as separator, dates in the first column, and also some more or less unpredictable text in the last column:
2019-02-01|||123|||345|||567|||Some unpredictable textual data with pipes|,
2019-02-02|||234|||345|||456|||weird symbols # and commas, and so on.
2019-02-03|||345|||234|||123|||text text text
When I try to run the following gnuplot5 script
set terminal png size 400,300
set output 'myplot.png'
set datafile separator "|||"
set xdata time
set timefmt "%Y-%m-%d"
set format x "%y-%m-%d"
plot "example.data" using 1:2 with linespoints
I get the following error:
line 8: warning: Skipping data file with no valid points
plot "example.data" using 1:2 with linespoints
^
"time.gnuplot", line 8: x range is invalid
Even stranger, if I change the last line to
plot "example.data" using 1:4 with linespoints
then it works. It also works for 1:7 and 1:10, but not for other numbers. Why?
When using the
set datafile separator "chars"
syntax, the string is not treated as one long separator. Instead, every character listed between the quotes becomes a separator on its own. From [Janert, 2016]:
If you provide an explicit string, then each character in the string will be
treated as a separator character.
Therefore,
set datafile separator "|||"
is actually equivalent to
set datafile separator "|"
and a line
2019-02-05|||123|||456|||789
is treated as if it had ten columns, of which only the columns 1,4,7,10 are non-empty.
Workaround
Find some other character that is unlikely to appear in the dataset (in the following, I'll assume \t as an example). If you can't dump the dataset with a different separator, use sed to replace ||| by \t:
sed 's/|||/\t/g' example.data > modified.data # in the command line
then proceed with
set datafile separator "\t"
and modified.data as input.
You basically gave the answer yourself.
If you can influence the separator in your data, use a separator which typically does not occur in your data or text. I always thought \t was made for that.
If you cannot influence the separator in your data, use an external tool (awk, Python, Perl, ...) to modify your data. In these languages it is probably a "one-liner". gnuplot has no direct replace function.
If you don't want to install external tools and want ensure platform independence, there is still a way to do it with gnuplot. Not just a "one-liner", but there is almost nothing you can't also do with gnuplot ;-).
Edit: simplified version with the input from #Ethan (https://stackoverflow.com/a/54541790/7295599).
Assuming you have your data in a dataset named $Data. The following code will replace ||| with \t and puts the result into $DataOutput.
### Replace string in dataset
reset session
$Data <<EOD
# data with special string separators
2019-02-01|||123|||345|||567|||Some unpredictable textual data with pipes|,
2019-02-02|||234|||345|||456|||weird symbols # and commas, and so on.
2019-02-03|||345|||234|||123|||text text text
EOD
# replace string function
# prefix RS_ to avoid variable name conflicts
replaceStr(s,s1,s2) = (RS_s='', RS_n=1, (sum[RS_i=1:strlen(s)] \
((s[RS_n:RS_n+strlen(s1)-1] eq s1 ? (RS_s=RS_s.s2, RS_n=RS_n+strlen(s1)) : \
(RS_s=RS_s.s[RS_n:RS_n], RS_n=RS_n+1)), 0)), RS_s)
set print $DataOutput
do for [RS_j=1:|$Data|] {
print replaceStr($Data[RS_j],"|||","\t")
}
set print
print $DataOutput
### end of code
Output:
# data with special string separators
2019-02-01 123 345 567 Some unpredictable textual data with pipes|,
2019-02-02 234 345 456 weird symbols # and commas, and so on.
2019-02-03 345 234 123 text text text

Q: SAS: taking data in dd.m.yyyy format from a csv

I need to import data from a csv-file. And I'm able to read everything else but the date. The date format is like dd.m.yyyy format: 6;Tiku;17.1.1967;M;191;
I'm guessing if I need to specify an informat to read it in? I can't figure out which one to use because nothing I've tried works.
What I've managed so far:
data [insert name here];
infile [insert name here];
dlm=";" missover;
length Avain 8 Nimi $10 Syntymapaiva 8 Sukupuoli $1 Pituus 8 Paino 5;
input
Avain Nimi $ Syntymapaiva ddmmyyp.(=this doesnt work) Sukupuoli$ Pituus
Paino;
format Paino COMMA5.2 ;
label Syntymapaiva="Syntymäpäivä";
run;
And part of the actual file to read in:
6;Tiku;17.1.1967;M;191;
Thank you for helping this doofus out!
There is no informat named DDMMYYP.. Use the informat DDMMYY. instead.
Also make sure to use the : modifier before the informat specification included in the INPUT statement so that you are still using list mode input instead of formatted input. If you use formatted input instead of list mode input then SAS could read past the delimiter.
input Avain Nimi Syntymapaiva :ddmmyy. Sukupuoli Pituus Paino;
Perhaps you are confused because there is a format named DDMMYYP.
Formats are used to convert values to text. Informats are what you need to use when you want to convert text to values.
553 options nofmterr ;
554 data _null_;
555 str='17.1.1967';
556 ddmmyy = input(str,ddmmyy10.);
557 ddmmyyp = input(str,ddmmyyp10.);
----------
485
NOTE 485-185: Informat DDMMYYP was not found or could not be loaded.
558 put str= (dd:) (= yymmdd10.);
559 _error_=0;
560 run;
NOTE: Invalid argument to function INPUT at line 557 column 13.
str=17.1.1967 ddmmyy=1967-01-17 ddmmyyp=.
NOTE: Mathematical operations could not be performed at the following places. The results of the operations have been set to
missing values.
Each place is given by: (Number of times) at (Line):(Column).
1 at 557:13
You could use the anydtdte informat, but (as #Tom points out) if your data is known to be fixed in this format, then ddmmyy. would be be better. Also, Tom's advice about using the : modifier is correct, and is preferable to use in most (if not all) cases.
data want;
infile cards dlm=";" missover;
input Avain Nimi:$10. Syntymapaiva:ddmmyy. Sukupuoli:$1. Pituus Paino;
format Paino COMMA5.2 Syntymapaiva date9.;
label Syntymapaiva="Syntymäpäivä";
datalines4;
6;Tiku;17.1.1967;M;191;
;;;;
run;
which gives:

How to import .txt with characters and numbers in Sas

I want to import a .txt file in SAS.
Here what looks like my data :
annee manufacturier modele categorie cylindree cylindres transmission ville ...
2016 Ford Focus 1 1.8 5 Manual 10.1
2016 Toyota Tercel 3 1.4 3 Auto 7.1
Here is my code
data car;
infile "C:\Users\Mark\Desktop\sas\car.txt"
LRECL=10000000 DLM=" " firstobs=2 ;
input
annee manufacturier modele categorie cylindree cylindres transmission type ville route combine emissiond indice
;
run;
But, when I run it, I have a lot of " Invalid data for ... " and then I end up with very few data in my table in SAS and lots of missing ones.
Some variables are numbers and some are characters. I feel like the problem is there.
How I could import that type of file ?
Thank you
A text file does not have any intrinsic data type. Everything is character, until you explicitly tell SAS the data type of your columns. Also, sometimes you need to tell SAS the input format, or informat, of your data.
Sometimes SAS is smart enough to guess correctly re: your data informat. For example, the below code generates the same results if you delete the informat statement. But, this would not be the case for say dates. In general, explicitly specifying the informat is best practice.
If your data was delimited, such as CSV, you could use PROC IMPORT to import your data. Using PROC IMPORT, SAS will make a best guess as to the data type based on the content of the columns (like Excel does when it imports text data).
The below code will import the data you specified:
filename temp temp;
data _null_;
infile datalines;
file temp;
input;
put _infile_;
datalines;
annee manufacturier modele categorie cylindree cylindres transmission ville
2016 Ford Focus 1 1.8 5 Manual 10.1
2016 Toyota Tercel 3 1.4 3 Auto 7.1
run;
data want;
infile temp firstobs=2;
length
annee 8
manufacturier $20
modele $20
categorie 8
cylindree 8
cylindres 8
transmission $20
ville 8
;
informat
cylindree 8.1
ville 8.1
;
input
annee
manufacturier
modele
categorie
cylindree
cylindres
transmission
ville
;
run;
If your data contained spaces, for example manufacturier = Mercedes Benz, then you would need to use an informat (eg. $char20.) for that column as well.

What am I doing wrong while importing the following data into sas

I am trying to import certain data into my SAS datset using this piece of code:
Data Names_And_More;
Infile 'C:\Users\Admin\Desktop\Torrent Downloads\SAS 9.1.3 Portable\Names_and_More.txt';
Input Name & $20.
Phone : $20.
Height & $10.
Mixed & $10.;
run;
The data in the file is as below:
Roger Cody (908)782-1234 5ft. 10in. 50 1/8
Thomas Jefferson (315)848-8484 6ft. 1in. 23 1/2
Marco Polo (800)123-4567 5Ft. 6in. 40
Brian Watson (518)355-1766 5ft. 10in 89 3/4
Michael DeMarco (445)232-2233 6ft. 76 1/3
I have been trying to learn SAS and while going through Ron Cody's book Learning SAS by example,I found to import the kind of data above, we can use 'the ampersand (&) informat modifier. The ampersand, like the colon,says to use the supplied informat, but the delimiter is now two or more blanks instead of just one.' (Ron's words, not mine). However, while importing this the result (dataset) is as follows:
Name Phone Height Mixed
Roger Cody (908)782- Thomas Jefferson Marco Polo
Also, for further details the SAS log is as follows:
419 Data Names_And_More;
420 Infile 'C:\Users\Admin\Desktop\Torrent Downloads\SAS 9.1.3 Portable\Names_and_More.txt';
421 Input Name & $20.
422 Phone : $20.
423 Height & $10.
424 Mixed & $10.
425 ;run;
NOTE:
The infile 'C:\Users\Admin\Desktop\Torrent Downloads\SAS 9.1.3 Portable\Names_and_More.txt' is:
File Name=C:\Users\Admin\Desktop\Torrent Downloads\SAS 9.1.3 Portable\Names_and_More.txt,
RECFM=V,LRECL=256
NOTE:
LOST CARD.
Name=Brian Watson (518)35 Phone=Michael Height=DeMarco (4 Mixed= ERROR=1 N=2
NOTE: 5 records were read from the infile 'C:\Users\Admin\Desktop\Torrent Downloads\SAS 9.1.3
Portable\Names_and_More.txt'.
The minimum record length was 37.
The maximum record length was 47.
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.NAMES_AND_MORE has 1 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.17 seconds
cpu time 0.14 seconds
I am looking for some help with this one. It'd be great if someone can explain what exactly is happening, what am I doing wrong and how to correct this error.
Thanks
The answer is in the explanation in Ron Cody's book. & means you need two spaces to separate varaibles; so you need a second space after the name (and other fields with &).
Wrong:
Roger Cody (908)782-1234 5ft. 10in. 50 1/8
Right:
Roger Cody (908)782-1234 5ft. 10in. 50 1/8