How does SAS decide which file to select when using wildcards - import

I have a SAS code that looks something like this:
DATA WORK.MY_IMPORT_&stamp;
INFILE "M:\YPATH\myfile_150*.csv"
delimiter = ';' MISSOVER DSD lrecl = 1000000 firstobs = 2 ignoredoseof;
[...]
RUN;
Now, at M:\YPATH I have several files named myfile_150.YYYYMMDD. The code works the way it is supposed to by importing always the latest file. I am wondering how SAS decides which file to choose, since the wildcard * can be replaced by anything. Does it sort the files in descending order and choose the first one?

On my system, SAS 9.4 TS1M4, SAS is reading ALL files that satisfy the wildcard.
I created 3 files (file_A.csv, file_B.csv, and file_C.csv). Each contain 1 record ('A', 'B', and 'C' respectively).
data test;
infile "c:\temp\file_*.csv"
delimiter = ';' MISSOVER DSD lrecl = 1000000 ignoredoseof;
format char $1.;
input char $;
run;
(Note I dropped the firstobs option from your code.)
The resulting TEST data set contains 3 observations, 'A', 'B', and 'C'.
This is the order of files returned when issuing
dir c:\temp\file_*.csv
SAS is using the default behavior of the OS and reading the files in that order.
25 data test;
26 infile "c:\temp\file_*.csv"
27 delimiter = ';' MISSOVER DSD lrecl = 1000000 ignoredoseof;
28 format char $1.;
29 input char $;
30 run;
NOTE: The infile "c:\temp\file_*.csv" is:
Filename=c:\temp\file_A.csv,
File List=c:\temp\file_*.csv,RECFM=V,
LRECL=1000000
NOTE: The infile "c:\temp\file_*.csv" is:
Filename=c:\temp\file_B.csv,
File List=c:\temp\file_*.csv,RECFM=V,
LRECL=1000000
NOTE: The infile "c:\temp\file_*.csv" is:
Filename=c:\temp\file_C.csv,
File List=c:\temp\file_*.csv,RECFM=V,
LRECL=1000000
NOTE: 1 record was read from the infile "c:\temp\file_*.csv".
The minimum record length was 1.
The maximum record length was 1.
NOTE: 1 record was read from the infile "c:\temp\file_*.csv".
The minimum record length was 1.
The maximum record length was 1.
NOTE: 1 record was read from the infile "c:\temp\file_*.csv".
The minimum record length was 1.
The maximum record length was 1.
NOTE: The data set WORK.TEST has 3 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.00 seconds

Related

SAS proc import guessingrows issue

I'm trying to import csv file to SAS using proc import; I know that guessingrows argument will determine automatically the type of variable for each column for my csv file. But there is an issue with one of my CSV file which has two entire columns with blank values; those columns in my csv file should be numeric, but after running the below code, those two columns are becoming character type, is there any solutions for how to change the type of those two columns into numeric during or after importing it to SAS ?
Here below is the code that I run:
proc import datafile="filepath\datasetA.csv"
out=dataA
dbms=csv
replace;
getnames=yes;
delimiter=",";
guessingrows=100;
run;
Thank you !
Modifying #Richard's code I would do:
filename csv 'c:\tmp\abc.csv';
data _null_;
file csv;
put 'a,b,c,d';
put '1,2,,';
put '2,3,,';
put '3,4,,';
run;
proc import datafile=csv dbms=csv replace out=have;
getnames=yes;
run;
Go to the LOG window and see SAS code produced by PROC IMPORT:
data WORK.HAVE ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile CSV delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat a best32. ;
informat b best32. ;
informat c $1. ;
informat d $1. ;
format a best12. ;
format b best12. ;
format c $1. ;
format d $1. ;
input
a
b
c $
d $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
Run this code and see that two last columns imported as characters.
Check it:
ods select Variables;
proc contents data=have nodetails;run;
Possible to modify this code and load required columns as numeric. I would not drop and add columns in SQL because this columns could have data somewhere.
Modified import code:
data WORK.HAVE ;
%let _EFIERR_ = 0; /* set the ERROR detection macro variable */
infile CSV delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
informat a best32. ;
informat b best32. ;
informat c best32;
informat d best32;
format a best12. ;
format b best12. ;
format c best12;
format d best12;
input
a
b
c
d
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
Check table description:
ods select Variables;
proc contents data=have nodetails;run;
You can change the column type of a column that has all missing value by dropping it and adding it back as the other type.
Example (SQL):
filename csv 'c:\temp\abc.csv';
data _null_;
file csv;
put 'a,b,c,d';
put '1,2,,';
put '2,3,,';
put '3,4,,';
run;
proc import datafile=csv dbms=csv replace out=have;
getnames=yes;
run;
proc sql;
alter table have
drop c, d
add c num, d num
;

How to ignore errors, but not skip lines in COPY command redshift?

I have the below COPY statement. It skips lines for maxerror. Is there any way to COPY data over to redshift, forcing any errors into the column regardless of type? I dont want to lose information.
sql_prelim = """copy table1 from 's3://dwreplicatelanding/file.csv.gz'
access_key_id 'xxxx'
secret_access_key 'xxxx'
DELIMITER '\t' timeformat 'auto'
GZIP IGNOREHEADER 1
trimblanks
CSV
BLANKSASNULL
maxerror as 100000
"""
The error I want to skip is below, but ideally I want to skip all errors and maintain data:
1207- Invalid digit, Value 'N', Pos 0, Type: Decimal

Macro increment

I have table lookup values as below
sno date
1 200101
2 200102
3 200103
4 200104
I wrote below macro
%let date=200102
proc sql;
select sno into :no from lookup where date=&date.;
quit;
I need a help on how to convert the entire table lookup into macro increment by creating first s.no and date as two macro variable then increment. So that i don’t need to update dates in my table lookup every time. So if i look up for date 201304 i need to get its corresponding s.no
Is there pattern to the SNO values? Are you basically numbering the months since 01JAN2001? If so then use INTCK() function.
data test;
input date yymmdd8. ;
format date yymmdd10. ;
sno = 1+intck('month','01JAN2001'd,date);
cards;
20010112
20010213
20010314
20010415
;
So you could create two macro variables. One with the base date and the other with the base SNO value.
36 %let basedate='01JAN2001'd ;
37 %let basesno=1;
38 %let date='01JAN2001'd ;
39 %let sno=%eval(&basesno + %sysfunc(intck(month,&basedate,&date)));
40 %put &=date &=sno;
DATE='01JAN2001'd SNO=1
41
42 %let date="%sysfunc(today(),date9)"d;
43 %let sno=%eval(&basesno + %sysfunc(intck(month,&basedate,&date)));
44 %put &=date &=sno;
DATE="16NOV2017"d SNO=203
If you want to simply translate one (unique) value into another. You can use (in)formats. They can do much more than just changing how data are read/displayed. They are easy to use, fast (in-memory) and don't depend on the table once created. Change the library to a permanent one if work (=> temporary library) doesn't suit your needs.
options fmtsearch=(formats,work);
data fmt(keep = fmtname type start end label hlo default);
length fmtname $10 type $1 start end $6 label 8 hlo $1 default 8;
fmtname = 'date_to_no';
type = 'I';
label=0;
do y = 2001 to 2099;
do m = 1 to 12;
start = put(y,4.) || put(m,z2.);
end = start;
label + 1;
default=50; /*default length of the string compared when informat is used. Should be higher than both start and end*/
output;
end;
end;
/*if you want to assign a value (=label) to inputs not found. In this case it's -2*/
hlo="O";
start = "";
end = start;
label= -2;
output;
run;
proc format library=work cntlin=fmt;
run;
data test;
no = input('200101',date_to_no.); output;
no = input('201710',date_to_no.); output;
no = input('201713',date_to_no.); output;
run;
Build a lookup table dynamically and create a macro variable for each row in the table. The macro variables will be named date_200101,date_200102,...and so on. They will contain a value equal to the corresponding sno value:
data lookup;
length var_name $20;
do sno = 1 to intck('month','01jan2001'd,date())+1;
date = input(put(intnx('month','01jan2001'd, sno-1, 'beginning'),yymmn6.),best.);
var_name = cats('date_',date);
call symput(var_name, cats(sno));
output;
end;
run;
You can then refer to the macro variables like so:
%let date =200103;
%put &&date_&date;
...or...
%put &date_200101;
The first usage example is using double macro resolution. Basically the macro processes needs to perform 2 iterations of the macro token &&date_&date in order to fully resolve it. On the first pass, it gets resolved to &date_200101. On the second pass, the macro token &date_200101 gets resolved to 1.

Replacing Turkish characters with English characters

I have a table which has 120 columns and some of them is including Turkish characters (for example "ç","ğ","ı","ö"). So i want to replace this Turkish characters with English characters (for example "c","g","i","o"). When i use "TRANWRD Function" it could be really hard because i should write the function 120 times and sometimes hte column names could be change so always i have to check the code one by one because of that.
Is there a simple macro which replaces this characters in all columns .
EDIT
In retrospect, this is an overly complicated solution... The translate() function should be used, as pointed by another user. It could be integrated in a SAS function defined with PROC FCMP when used repeatedly.
A combination of regular expressions and a DO loop can achieve that.
Step 1: Build a conversion table in the following manner
Accentuated letters that resolve to the same replacement character are put on a single line, separated by the | symbol.
data conversions;
infile datalines dsd;
input orig $ repl $;
datalines;
ç,c
ğ,g
ı,l
ö|ò|ó,o
ë|è,e
;
Step 2: Store original and replacement strings in macro variables
proc sql noprint;
select orig, repl, count(*)
into :orig separated by ";",
:repl separated by ";",
:nrepl
from conversions;
quit;
Step 3: Do the actual conversion
Just to show how it works, let's deal with just one column.
data convert(drop=i re);
myString = "ç ğı òö ë, è";
do i = 1 to &nrepl;
re = prxparse("s/" || scan("&orig",i,";") || "/" || scan("&repl",i,";") || "/");
myString = prxchange(re,-1,myString);
end;
run;
Resulting myString: "c gl oo e, e"
To process all character columns, we use an array
Say your table is named mySource and you want all character variables to be processed; we'll create a vector called cols for that.
data convert(drop=i re);
set mySource;
array cols(*) _character_;
do c = 1 to dim(cols);
do i = 1 to &nrepl;
re = prxparse("s/" || scan("&orig",i,";") || "/" || scan("&repl",i,";") || "/");
cols(c) = prxchange(re,-1,cols(c));
end;
end;
run;
When changing single characters TRANSLATE is the proper function, it will be one line of code.
translated = translate(string,"cgio","çğıö");
First get all your columns from dictionary, and then replace the values of all of them in a macro do loop.
You can try a program like this (Replace MYTABLE with your table name):
proc sql;
select name , count(*) into :columns separated by ' ', :count
from dictionary.columns
where memname = 'MYTABLE';
quit;
%macro m;
data mytable;
set mytable;
%do i=1 %to &count;
%scan(&columns ,&i) = tranwrd(%scan(&columns ,&i),"ç","c");
%scan(&columns ,&i) = tranwrd(%scan(&columns ,&i),"ğ","g");
...
%end;
%mend;
%m;

How to calculate conditional cumulative sum

I have a dataset like the one below, and I am trying to take a running total of events 2 and 3, with a slight twist. I only want to count these events when the Event_1_dt is less than the date in the current record. I'm currently using a macro %do loop to iterate through each record for that item type. While this produces the desired results, performance is slower than desirable. Each Item_Type may have up to 1250 records, and there are a couple thousand types. Is it possible to exit the loop before it cycles through all 1250 iterations? I am hesitant to try joins because there are some 30+ events to count up, but I'm open to suggestions. An additional complication is that even though Event_1_dt is always greater then Date, is does not have any other limitations.
Item_Type Date Event_1_dt Event_2_flg Event_3Flg Desired_Event_2_Cnt Desired_Event_3_Cnt
A 1/1/2014 1/2/2014 1 1 0 0
A 1/2/2014 1/2/2014 0 1 0 0
A 1/3/2014 1/8/2014 1 0 1 2
B 1/1/2014 1/2/2014 1 0 0 0
B 1/2/2014 1/5/2014 1 0 0 0
B 1/3/2014 1/4/2014 1 1 1 0
B 1/4/2014 1/5/2014 0 1 1 0
B 1/5/2014 . 1 1 2 1
B 1/6/2014 1/7/2014 1 1 3 2
Corresponding Code:
%macro History;
data y;
set x;
Event_1_Cnt = 0;
Event_2_Cnt = 0;
%do i = 1 %to 1250;
lag_Item_Type = lag&i(Item_Type);
lag_Event_2_flg = lag&i(Event_2_flg);
lag_Event_3_flg = lag&i(Event_3_flg);
lag_Event_1_dt = lag&i(Event_1_dt);
if Item_Type = lag_Item_Type and lag_Event_1_dt > . and lag_Event_1_dt < Date then do;
if lag_Event_2_flg = 1 then do;
Event_2_Cnt = Event_2_cnt + 1;
end;
if lag_Event_3_flg = 1 then do;
Event_3_Cnt = Event_3_cnt + 1;
end;
end;
%end;
run;
%mend;
%History;
Well, that's not a trivial task for SAS, but still it can be solved in one DATA-step, without merging. You can use hash objects. The idea is as follows.
Within each item type, going record by record, we 'collect' event flags into 'bins' in a hash object, where each bin is a certain date. All bins are ordered by date in ascending order. Simultaneously, we insert the Date of the current record into the same hash (into corresponding place by date) and then iterate 'up' from this place, summing up all gathered by this moment bins (which will have dates < then date of the current record, since we going up).
Here's the code:
data have;
informat Item_Type $1. Date Event_1_dt mmddyy9. Event_2_flg Event_3_flg 8.;
infile datalines dsd dlm=',';
format Date Event_1_dt date9.;
input Item_Type Date Event_1_dt Event_2_flg Event_3_flg;
datalines;
A,1/1/2014,1/2/2014,1,1
A,1/2/2014,1/2/2014,0,1
A,1/3/2014,1/8/2014,1,0
B,1/1/2014,1/2/2014,1,0
B,1/2/2014,1/5/2014,1,0
B,1/3/2014,1/4/2014,1,1
B,1/4/2014,1/5/2014,0,1
B,1/5/2014,,1,1
B,1/6/2014,1/7/2014,1,1
;
run;
proc sort data=have; by Item_Type; run;
data want;
set have;
by Item_Type;
if _N_=1 then do;
declare hash h(ordered:'a');
h.defineKey('Event_date','type');
h.defineData('event2_cnt','event3_cnt');
h.defineDone();
declare hiter hi('h');
end;
/*for each new Item_type we clear the hash completely*/
if FIRST.Item_Type then h.clear();
/*now if date of Event 1 exists we put it into corresponding */
/* (by date) place of our ordered hash. If such date is already*/
/*in the hash, we increase number of events for this date */
/*adding values of Event2 and Event3 flags. If no - just assign*/
/*current values of these flags.*/
if not missing(Event_1_dt) then do;
Event_date=Event_1_dt;type=1;
rc=h.find();
event2_cnt=coalesce(event2_cnt,0)+Event_2_flg;
event3_cnt=coalesce(event3_cnt,0)+Event_3_flg;
h.replace();
end;
/*now we insert Date of the record into the same oredered hash,*/
/*making type=0 to differ this item from items where date means*/
/*date of Event1 (not date of record)*/
Event_date=Date;
event2_cnt=0; event3_cnt=0; type=0;
h.replace();
Desired_Event_2_Cnt=0;
Desired_Event_3_Cnt=0;
/*now we iterate 'up' from just inserted item, i.e. looping */
/*through all items that have date < the date of the record. */
/*Items with date = the date of the record will be 'below' since*/
/*they have type=1 and our hash is ordered by dates first, and */
/*types afterwards (1's will be below 0's)*/
hi.setcur(key:Date,key:0);
rc=hi.prev();
do while(rc=0);
Desired_Event_2_Cnt+event2_cnt;
Desired_Event_3_Cnt+event3_cnt;
rc=hi.prev();
end;
drop Event_date type rc event2_cnt event3_cnt;
run;
I can't test it with your real number of rows, but I believe it should be pretty fast, since we loop only through a small hash object, which is entirely in memory, and we do only as many loops for each record as necessary (only earlier events) and don't do any IF-checks.
I dont think a Hash is neccessary for this - it seems like a simple data-step will do the trick. This might prevent you (or the next programmer who comes across your code) from needing to 're-read and do research' in order to understand it.
I think the following will work:
data have;
informat Item_Type $1. Date Event_1_dt mmddyy9. Event_2_flg Event_3_flg 8.;
infile datalines dsd dlm=',';
format Date Event_1_dt date9.;
input Item_Type Date Event_1_dt Event_2_flg Event_3_flg;
datalines;
A,1/1/2014,1/2/2014,1,1
A,1/2/2014,1/2/2014,0,1
A,1/3/2014,1/8/2014,1,0
B,1/1/2014,1/2/2014,1,0
B,1/2/2014,1/5/2014,1,0
B,1/3/2014,1/4/2014,1,1
B,1/4/2014,1/5/2014,0,1
B,1/5/2014,,1,1
B,1/6/2014,1/7/2014,1,1
;
data want2 (drop=_: );
set have;
by ITEM_Type;
length _Alldts_event2 _Alldts_event3 $20000;
retain _Alldts_event2 _Alldts_event3;
*Clear _ALLDTS for each ITEM_TYPE;
if first.ITEM_type then Do;
_Alldts_event2 = "";
_Alldts_event3 = "";
END;
*If event is flagged, concatenate the Event_1_dt to the ALLDTS variable;
if event_2_flg = 1 Then _Alldts_event2 = catx(" ", _Alldts_event2,Event_1_dt);
if event_3_flg = 1 Then _Alldts_event3 = catx(" ", _Alldts_event3,Event_1_dt);
_numWords2 = COUNTW(_Alldts_event2);
_numWords3 = COUNTW(_Alldts_event3);
*Loop through alldates, count the number that are < the current records date;
cnt2=0;
do _i = 1 to _NumWords2;
_tempDate = input(scan(_Alldts_event2,_i),Best12.);
if _tempDate < date Then cnt2=cnt2+1;
end;
cnt3=0;
do _i = 1 to _NumWords3;
_tempDate = input(scan(_Alldts_event3,_i),Best12.);
if _tempDate < date Then cnt3=cnt3+1;
end;
run;
I believe the Hash may be faster, but you'll have to decide on what tradeoff of comprehensibility/performance is appropriate.