SAS macro to read multiple rawdata files and create multiple SAS dataset for each raw data file

SAS macro to read multiple rawdata files and create multiple SAS dataset for each raw data file - data-management

SAS macro to read multiple rawdata files and create multiple SAS dataset for each raw data file
Hi there
My name is Chandra. I am not very good at SAS macro especially the looping part and resolving &&. etc. Here is my problem statement.
Problem statement:
I have large number of raw data files (.dat files) stored in a folder in a SAS server. I need a macro that can read each of these raw data file and create SAS data set for each raw data file and store them in a separate target folder in the SAS server. All these raw data files have same file layout structure. I need to automate this operation so that every week, the macro reads raw data files from the source folder and creates the corresponding SAS dataset and stores them in the target folder in the SAS server. For example, if there are 200 raw data files in a source folder, I want to read them and create 200 SAS datasets one for each raw data file and save them in a target folder. I am not very good at constructing looping statement and also resolving && or &&& etc. How do I do it?
I would highly appreciate your kind assistance in this regard.
Respectfully
Chandra

You don't need to necessarily use a macro or a loop in case you have files with same fields. You can try pipe option and the filename keyword. Here is the link

You do not need a macro for this type of processing.
The INFILE statment will accept a file specification that includes operating system wildcards.
This example creates 200 text files in the work folder and then reads them back in in a single step.
I highly recommend not creating 200 separate data sets. Instead keep the filename, or a unique portion thereof, as a categorical variable that can be used later in a CLASS or BY statement, or as part of the criteria of a sub-setting WHERE.
%let workpath = %sysfunc(pathname(WORK));
* create something to input;
data _null_;
do i = 0 to 1999;
if mod(i,10) = 0 then filename = cats("&workpath./",'sample',i/10,'.txt');
file sample filevar=filename;
x = i; y = x**2;
put i x y;
end;
run;
* input data from 200 different files that have the same layout;
data samples;
length filename $250;
infile "&workpath.\*.txt" filename=filename; %* <-- Here be the wildcards;
input i x y;
source = filename;
run;

Related

How to export statistics_log data cumulatively to the desired Excel sheet in AnyLogic?

I have selected to export tables at the end of model execution to an Excel file, and I would like that data to accumulate on the same Excel sheet after every stop and start of the model. As of now, every stop and start just exports that 1 run's data and overwrites what was there previously. I may be approaching the method of exporting multiple runs wrong/inefficiently but I'm not sure.

Best method is to export the raw data, as you do (if it is not too large).
However, 2 improvements:
manage your output data yourself, i.e. do not rely on the standard export tables but only write data that you really need. Check this help article to learn how to write your own data
in your custom output data tables, add additional identification columns such as date_of_run. I often use iteration and replication columns to also identify from which of those the data stems.
custom csv approach
Alternative approach is to write create your own csv file programmatically, this is possible with Java code. Then, you can create a new one (with a custom filename) after any run:
First, define a “Text file” element as below:
Then, use this code below to create your own csv with a custom name and write to it:
File outputDirectory = new File("outputs");
outputDirectory.mkdir();
String outputFileNameWithExtension = outputDirectory.getPath()+File.separator+"output_file.csv";
file.setFile(outputFileNameWithExtension, Mode.WRITE_APPEND);
// create header
file.println( "col_1"+","+"col_2");
// Write data from dbase table
List<Tuple> rows = selectFrom(my_dbase_table).list();
for (Tuple row : rows) {
file.println( row.get( my_dbase_table.col_1) + "," +
row.get( my_dbase_table.col_2));
}
file.close();

Azure Data Factory reading Blob folder dealing with random characters

I would like to define a dataset that is has a file path of
awsomedata/1992-12-25{random characters}MT/users.json
I am unsure of how to use the expression language fully. I have figured out the following
#startsWith( pipeline().parameters.filepath(),concat('awsomedata/',formatDateTime(utcnow('d'), 'yyyy-MM-dd')), #pipeline().parameters.filePath)
The dataset will change dynamically, I am trying to tell it to look at the file each trigger to determine the schema.

How to convert list of dictionaries into bytes stream and load it to the database

My standard way to run bulk upload of CSV files to Postgres database is to use copy_expert()
method:
cursor.copy_expert("copy %s from STDIN CSV HEADER NULL 'NULL' QUOTE '\"';" % (table_name), file=f)
Very often, before loading file to the database, I run some pre-processing of CSV file.
The results are always kept in list of dictionaries.
Following my standard way of loading I have to offload list of dictionaries to temporary CSV file and
only that file give to copy_expert() method as source of data.
What I would like to do is to replace source as a file with source as a some stream of bytes converted
from list of dictionaries. And then pass it to copy_expert() as a source.
I would like to exclude step of writing temporary CSV file and load data straight from memory.

ADF / Dataflow - Convert Multiple CSV to Parquet

In ADLS Gen2, TextFiles folder has 3 CSV files. Column names are different in each file.
We need to convert all 3 CSV files to 3 parquet files and put it in ParquetFiles folder
I tried to use Copy Activity and it fails because the column names have empty space in it and parquet files doesn't allow it
To remove spaces, I used Data flow: Source -> Select (replace space by underscore in col name) and sink. This worked for a single file. When I tried to do it for all 3 files, it tries to merge 3 files and generates single file with incorrect data.
How to solve this, mainly removing spaces from column names in all files. What would be the other options here?

Pipeline: ForEach activity (loop over CSV files in folder and send in current iteration item to data flow as param) -> Data Flow activity with source that points to that folder (parameterize the file name in the source path)

I created 2 datasets, one in csv with wildcard format, the other in parquet. I used the Data Copy Activity using the parquet data set as sink and csv data set as source. I set the copy behavior to Merge files.

Store .fmt file in SQL Server

Is it possible to store .fmt file right in database like stored procedure, not in separate file?
Imported files varies, format file is constant for the procedure. No BLOBs, no FILESTREAM used.
...
FROM OPENROWSET (
BULK 'd:\path\some_variable_file.txt',
FIRSTROW = 2,
FORMATFILE = 'd:\path\importformat.fmt'
) AS import

OPENROWSET does not support any source besides the file system for FORMATFILEs. One option you could do is store the format file data (in either the non-xml or xml format) in a database and extract it with a powershell script

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

SAS macro to read multiple rawdata files and create multiple SAS dataset for each raw data file - data-management

You don't need to necessarily use a macro or a loop in case you have files with same fields. You can try pipe option and the filename keyword. Here is the link

Related

How to export statistics_log data cumulatively to the desired Excel sheet in AnyLogic?

Azure Data Factory reading Blob folder dealing with random characters

How to convert list of dictionaries into bytes stream and load it to the database

ADF / Dataflow - Convert Multiple CSV to Parquet

Store .fmt file in SQL Server

Categories

Resources