Here's my code. I am unable to read the dates from the input, it keeps giving me incorrect format, I tried changing a few times to mmddyy10. mmddyy8. and others but it still does not read them in correctly.
data master_patients;
infile datalines;
input account_number name $8-16 address $17-34 date MMDDYYYY10. gender $1.
insurance_code $49-51 updated_date mmddyyyy10.;
datalines;
620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998
645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999
645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993
874329 Kazoyan 76-C La Vista . . MCD 01-15-2003
;
proc print data=master_patients;
run;
Could you please point out where I am going wrong? Thanks for any help.
I recommend a specific informat, rather than anydtdte though it helps you get started. It will ensure that your data is correct.
data master_patients;
infile datalines;
informat date updated_date mmddyy10.;
format date updated_date date9.;
input account_number name $ 8-16 address $ 17-34 date gender $1.
insurance_code $ 49-51 updated_date;
datalines;
620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998
645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999
645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993
874329 Kazoyan 76-C La Vista . . MCD 01-15-2003
;
run;
There are two main problems. First the informat name does not have 4 Y's in it. Just 2. Second you don't have the column pointer in the right place when you are trying to read 10 characters as a date so that you are getting a blank and then the first 9 characters of the date. SAS cannot represents dates in the second or third century AD. Try MDY(12,21,197) and see what happens.
data master_patients;
infile datalines firstobs=2;
input account_number name $8-16 address $17-34 #36 date MMDDYY10.
gender $1. insurance_code $49-51 #53 updated_date mmddyy10.
;
datalines;
----+----1----+----2----+----3----+----4----+----5----+----6----+
620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998
645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999
645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993
874329 Kazoyan 76-C La Vista . . MCD 01-15-2003
;
proc print data=master_patients;
run;
For modified list input for this problem.Just add ":" between variable name and informat.
data master_patients;
infile datalines;
input account_number name $8-16 address $17-34 date : mmddyy10. gender $1.
insurance_code $49-51 updated_date : mmddyy10.;
datalines;
620135 Smith 234 Aspen St. 12-21-1975 m CBC 02-16-1998
645722 Miyamoto 65 3rd Ave. 04-03-1936 f MCR 05-30-1999
645739 Jensvold 505 Glendale Ave. 06-15-1960 f HLT 09-23-1993
874329 Kazoyan 76-C La Vista . . MCD 01-15-2003
;
proc print data=master_patients;
run;
Please note if you don't add ":" , just change mmddyy10. to anydtdte. , the data read into dataset may Not correct.
Related
Am quite new in the Unix field and I am currently trying to extract data set from a text file. I tried with sed, grep, awk but it seems to only work with extracting lines, but I want to extract an entire dataset... Here is an example of file from which I'd like to extract the 2 data sets (figures after the lines "R.Time Intensity")
[Header]
Application Name LabSolutions
Version 5.87
Data File Name C:\LabSolutions\Data\Antoine\170921_AC_FluoSpectra\069_WT3a derivatized lignin LiCl 430_GPC_FOREVER_430_049.lcd
Output Date 2017-10-12
Output Time 12:07:32
[Configuration]
Instrument Name BOTAN127-Instrument1
Instrument # 1
Line # 1
# of Detectors 3
Detector ID Detector A Detector B PDA
Detector Name Detector A Detector B PDA
# of Channels 1 1 2
[LC Chromatogram(Detector A-Ch1)]
Interval(msec) 500
# of Points 9603
Start Time(min) 0,000
End Time(min) 80,017
Intensity Units mV
Intensity Multiplier 0,001
Ex. Wavelength(nm) 405
Em. Wavelength(nm) 430
R.Time (min) Intensity
0,00000 -709779
0,00833 -709779
0,01667 17
0,02500 3
0,03333 7
0,04167 19
0,05000 9
0,05833 5
0,06667 2
0,07500 24
0,08333 48
[LC Chromatogram(Detector B-Ch1)]
Interval(msec) 500
# of Points 9603
Start Time(min) 0,000
End Time(min) 80,017
Intensity Units mV
Intensity Multiplier 0,001
R.Time (min) Intensity
0,00000 149
0,00833 149
0,01667 -1
I would greatly appreciate any idea. Thanks in advance.
Antoine
awk '/^[^0-9]/&&d{d=0} /R.Time/{d=1}d' file
Brief explanation,
Set d as a flag to determine print line or not
/^[^0-9]/&&d{d=0}: if regex ^[^0-9] matched && d==1, disabled d
/R.Time/{d=1}: if string "R.Time" searched, enabled d
awk '/R.Time/,/LC/' file|grep -v -E "R.Time|LC"
grep part will remove the R.Time and LC lines that come as a part of the output from awk
I think it's a job for sed.
sed '/R.Time/!d;:A;N;/\n$/!bA' infile
I am trying to generate a new variable using the stored result r(mean) from the command sum.
I have a continuous variable 'age'.
So,
sum age
g age0=age–r(mean)
The problem is that this gives the error
unknown function age–r()
r(133);.
Works for me:
. sysuse auto, clear
(1978 Automobile Data)
. su mpg
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
mpg | 74 21.2973 5.785503 12 41
. gen mpg0 = mpg - r(mean)
. su mpg0
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
mpg0 | 74 -4.03e-08 5.785503 -9.297297 19.7027
Watch for strange characters, i.e. that the minus sign really is a hyphen [sic].
I want to replace all instances of "a number followed by any number of spaces followed by a period and possibly more spaces" with the number and period only.
For example, '14 . x' will become '14.x'.
My test data is:
1. c4 e5 2. g3 c6 { good move. } 3. Bg2 Nf6 4. Nc3 $6 d5 5. cxd5 cxd5 6. Qb3 Nc6 $1.. Nxd5 Nd4
8. Nxf6+ Qxf6 9. Qd1.f5 10. d3 Rc8 (10... Bb4+ $5 11. Bd2 Bxd2+ 12. Qxd2 Qa6 $1.3. Rc1.xa2
14. Bxb7 $2 Rb8 15. Qb4 Bd7) 11. Kf1.c5 12. Nf3 O-O
How can I do that?
If you want any number of spaces removed from either side of the period, you should try s/\([0-9]\) *\. */\1./g:
$ echo '11. A 12 .B 13 . C 14.D 15 . E' | sed 's/\([0-9]\) *\. */\1./g'
11.A 12.B 13.C 14.D 15.E
For your test data, the results are:
1.c4 e5 2.g3 c6 { good move. } 3.Bg2 Nf6 4.Nc3 $6 d5 5.cxd5 cxd5 6.Qb3 Nc6 $1.. Nxd5 Nd4
8.Nxf6+ Qxf6 9.Qd1.f5 10.d3 Rc8 (10... Bb4+ $5 11.Bd2 Bxd2+ 12.Qxd2 Qa6 $1.3.Rc1.xa2
14.Bxb7 $2 Rb8 15.Qb4 Bd7) 11.Kf1.c5 12.Nf3 O-O
I am currently working with a data set of 760 metabolites. These metabolites were provided to 15 bacterial species. Further, their growth was monitored at 2 optical densities (OD), in triplicate. Therefore, I have a data set with 1520 rows and 17 columns.
optins formdlim='.';
data = Vera`
input Metabolite$ OD P1 P2 P3 P4 P5 P6 P7 C8 C9 C10 B11 B12 B13 B14 B15
;
cards;
proc print; run;
What I want to do is to find out whether the data of the 2 optical densities is the same (no significant difference between the data at 492 and 630 for each metabolite. Therefore, I wrote the following code:
PROC mixed DATA=Vera;
CLASS OD ;
MODEL P1 = OD / SOLUTION ;
lsmeans OD/ diff;
RUN;
With this you can analyse the differences between the ODs at 492 and 630 for each of the bacterial species. However, you would have to separate the data and run each metabolite separately. That would not be a problem with a small number of variables, but I have 760. So I don't want to repeat the procedure and input manually the code 760 times. I want to write a macro to use in SAS and repeat the same syntax for each of the metabolites. How can I do that?
My data looks like this:
Metabolite OD P1 P2 P3 P4 P5 P6 P7 C8 C9 C10 B11 B12 B13 B14 B15
C1 492 0.80318008 0.834511094 0.755462174 0.947215787 0.887920107 0.941135272 0.854403285 0.827162124 0.774818623 1.043873527 0.980611933 0.99175232 0.899985465 2.323935576 0.989680725
C1 492 1.015295591 0.937931127 0.862409875 1.035489644 1.020100969 1.432972263 1.20098598 1.014347313 1.024901914 1.350518389 1.228546301 1.058456868 1.021602321 0.882652756 1.068231275
C1 492 0.810476853 0.767190317 0.566538969 1.160767653 1.036374265 1.007790833 1.190486783 1.113972414 0.325186332 0.907718954 1.675218213 0.906072763 1.410147143 1.060946843 1.067602052
C1 630 0.961524961 1.005846657 0.847824375 1.025462906 0.976906071 0.976627864 1.01474825 0.903212955 0.934967536 0.882814468 1.001740347 0.903248894 0.996416257 1.02681187 0.916566129
C1 630 1.554650956 0.737506567 2.452827299 1.037786536 0.874060377 0.950382623 1.081525591 2.143129784 1.077641166 1.993884723 1.685291793 0.927601975 1.097186964 0.84841252 0.942020551
C1 630 3.397638555 3.48494389 2.736307131 4.485634181 4.927877673 4.754434301 5.041446678 3.008039216 1.24514729 3.849372819 3.335763153 4.537001962 4.347699905 2.650736885 5.007861571
C2 492 0.621121776 0.655197791 0.624464533 0.774748488 0.835036637 0.890241965 1.050214203 0.766379479 0.499753317 0.708279952 0.851083004 0.833468896 0.842360044 0.536406298 0.722104984
C2 492 1.75496053 1.625140448 1.234260466 1.600459563 1.805650674 3.902582698 4.366733197 4.3322092 0.884777351 3.659221055 3.698372956 4.424445968 3.911657965 1.184654064 3.032617686
C2 492 1.136163306 0.990741638 1.008046619 1.090941503 1.065424996 1.286243284 1.162517672 1.086776372 1.050708989 0.947436205 1.255244694 1.097283143 1.064965485 1.025620139 0.974254224
C2 630 1.113004223 1.481277257 1.117820203 1.606865598 1.547740666 1.923981394 1.79028251 1.600927099 0.651330519 1.688562315 1.671669463 1.596206391 1.999786168 1.112853138 1.95607287
C2 630 0.802575958 0.63027506 0.688188658 0.879770793 0.779821048 0.884177322 0.942509034 0.755849107 0.630951119 0.712527463 0.897567203 0.847457282 0.838313324 0.696858072 0.737402398
C2 630 3.868652818 3.623364192 2.899296194 4.850127834 5.171682933 5.239876518 5.407341626 3.381502495 1.345204779 4.170354345 3.676830466 4.893081332 4.646074976 2.792233812 5.15275719
I tried creating a macro but I think I am not getting what I would like to see. Again, I have 16 bacterial species, which were grown in presence of 760 metabolites. The growth was measured at 2 different OD. I want to find out whether the OD measurement for each metabolite on each species is significantly different. I modified the macro and the model as follows:
options formdlim='-';
data vera;
input Metabolite$ OD P1 P2 P3 P4 P5 P6 P7 C8 C9 C10 B11 B12 B13 B14 B15 B16;
cards;
%macro metabolites(varsel);
PROC mixed DATA=Vera;
CLASS OD ;
MODEL &varsel = OD / ddfm=kr ;
lsmeans OD/pdiff;
RUN;
%mend
%metabolites (P1);
%metabolites (P2);
%metabolites (P3);
%metabolites (P4);
%metabolites (P5);
%metabolites (P6);
%metabolites (P7);
%metabolites (C8);
%metabolites (C9);
%metabolites (C10);
%metabolites (B11);
%metabolites (B12);
%metabolites (B13);
%metabolites (B14);
%metabolites (B15);
%metabolites (B16);
With this model I can see that all the metabolites measured in a particular species at a specific OD (492 nm, for example) are not significantly different from ALL the metabolites measured at 630 nm. Unfortunately, this does not have biological relevance and I am still needing to repeat the syntax every time that I want to find significant differences among ODs in a particular metabolite in a particular species.
I also tried sorting out the data set with the "BY" statement but I did not get any different output. Is there anything else that I am missing?
Just specify a BY statement:
PROC mixed DATA=Vera;
CLASS OD ;
MODEL P1 = OD / SOLUTION ;
BY Metabolite;
lsmeans OD/ diff;
RUN;
Knowledge base reference here.
Change your data structure so you can use the BY statement as mentioned above. I haven't tested it but something like the following should work:
*flip your data;
data flipped;
set vera;
array bs(15) P1-P7 C8-C10 B11-B15;
do i = 1 to dim(bs);
BS = vname(bs(i)); *capture the name of the variable;
BS_Value = bs(i);
output;
end;
run;
proc sort data=flipped;
by metabolite bs;
run;
PROC mixed DATA=Vera;
By metabolite bs;
CLASS OD ;
MODEL BS_value = OD / SOLUTION ;
lsmeans OD/ diff;
RUN;
I have the following sample data to read into SAS
2012-05-0317:36:00NYA
2012-05-0410:29:00SNW
2012-05-2418:45:00NYA
2012-05-2922:24:00NSL
2012-05-3107:26:00DEN
2012-05-2606:10:00PHX
2012-05-0202:30:00FTW
2012-05-0220:45:00HOB
2012-05-0103:01:00HGR
2012-05-0120:30:00RCH
2012-05-1112:00:00NAS
However, there is a strange problem bothering me.
Here is my first try.
data test;
informat DT yymmdd10.
TM $TIME8.
orig $3.
;
format DT yymmddd10.
TM TIME8.
orig $3.
;
input
#1 DT_temp
#11 TM_temp
#19 orig
;
datalines;
2012-05-0317:36:00NYA
2012-05-0410:29:00SNW
2012-05-2418:45:00NYA
2012-05-2922:24:00NSL
2012-05-3107:26:00DEN
2012-05-2606:10:00PHX
2012-05-0202:30:00FTW
2012-05-0220:45:00HOB
2012-05-0103:01:00HGR
2012-05-0120:30:00RCH
2012-05-1112:00:00NAS
run;
The result shows
DT TM orig
. . NYA
. . SNW
. . NYA
. . NSL
. . DEN
. . PHX
. . FTW
. . HOB
. . HGR
. . RCH
. . NAS
This means the date and time are not read correctly. A work around I have right now is to read everything as string first and then convert it to date and time respectively.
data test;
informat DT_temp $10.
TM_temp $8.
orig $3.
;
format DT yymmddd10.
TM TIME8.
orig $3.
;
input
#1 DT_temp
#11 TM_temp
#19 orig
;
DT=input(strip(DT_temp),yymmdd10.);
TM=input(strip(TM_temp),time8.);
drop DT_temp TM_temp;
datalines;
2012-05-0317:36:00NYA
2012-05-0410:29:00SNW
2012-05-2418:45:00NYA
2012-05-2922:24:00NSL
2012-05-3107:26:00DEN
2012-05-2606:10:00PHX
2012-05-0202:30:00FTW
2012-05-0220:45:00HOB
2012-05-0103:01:00HGR
2012-05-0120:30:00RCH
2012-05-1112:00:00NAS
run;
In this way, everything gets the correct format.
orig DT TM
NYA 2012-05-03 17:36:00
SNW 2012-05-04 10:29:00
NYA 2012-05-24 18:45:00
NSL 2012-05-29 22:24:00
DEN 2012-05-31 7:26:00
PHX 2012-05-26 6:10:00
FTW 2012-05-02 2:30:00
HOB 2012-05-02 20:45:00
HGR 2012-05-01 3:01:00
RCH 2012-05-01 20:30:00
NAS 2012-05-11 12:00:00
Basically, these two methods used the same informat. I was wondering why the first method does not work. Appreciate for any kind of help. Thank you very much.
Your "first try" code has a couple errors, but I'm guessing they were introduced while writing the question.
Because you are using column-oriented input, you need to specify the format to be used for each variable. Here is a corrected version:
data test;
informat DT yymmdd10.
TM TIME8.
orig $3.
;
format DT yymmddd10.
TM TIME8.
orig $3.
;
input
#1 DT yymmdd10.
#11 TM TIME8.
#19 orig $3.
;
datalines;
2012-05-0317:36:00NYA
2012-05-0410:29:00SNW
2012-05-2418:45:00NYA
2012-05-2922:24:00NSL
2012-05-3107:26:00DEN
2012-05-2606:10:00PHX
2012-05-0202:30:00FTW
2012-05-0220:45:00HOB
2012-05-0103:01:00HGR
2012-05-0120:30:00RCH
2012-05-1112:00:00NAS
run;