Stata - efficiently appending 200+ files (my method takes hours) - append

I am trying to append approx. 200 files using Stata. Below I have provided the code I am using to append. The issue is that it is taking too long -- over 5 hours to do. The ultimate appended file has over 28 million observations and is about 2GB in size. I think the issue might be that it is saving every time and hence takes too long. I also tried using the tempfile mode -- but that also takes long. My colleague, on the other hand, did the same append in minutes using SAS. I have provided his code below as well. I would very much appreciate if someone could show me how to do it efficiently in Stata -- so that it would not take hours. Thanks much!
My Stata code:
file close _all
file open myfile using "$OP\filelist_test.txt", read
file read myfile line
cd "$OP"
insheet using "`line'", comma clear
tostring optionconditioncode, replace
save "$data\options_all", replace
file read myfile line
while r(eof)==0{
insheet using "`line'", comma clear
tostring optionconditioncode, replace
append using "$data\options_all"
save "$data\options_all", replace
file read myfile line
}
file close myfile
My colleague's SAS code:
data all_text (drop=fname);
length myfilename $100;
set dirlist;
filepath = "&dirname\"||fname;
infile dummy filevar = filepath length=reclen end=done missover dlm=',' firstobs=2 dsd;
do while(not done);
myfilename = filepath;
input var1
var2
var3
var4
output;
end;

Seems like the OP has not been around lately. The solution given by Robert Picard from the Stata forum link that the OP has provided is as follows:
> Take a look at -filelist- from SSC. It can create a Stata dataset of
> files (with full path). The help file has an example that does what
> you want efficiently. Here's a copy:
>
> use "csv_datasets.dta", clear
> local obs = _N
> forvalues i=1/`obs' {
> use "csv_datasets.dta" in `i', clear
> local f = dirname + "/" + filename
> insheet using "`f'", clear
> tempfile save`i'
> save "`save`i''"
> }
>
> use "`save1'", clear
> forvalues i=2/`obs' {
> append using "`save`i''"
> }

Related

determine if the file is empty and separate them into different file

The goal of my code is to look into a certain folder and create a new text file with a list of names of all the files that aren't empty in that folder written to a new file, and the list of names of all the empty files (no text) into another folder. My current code is only able to create a new text file with a list of names of all the files (regardless of its content) written to a new file. I want to know how to set up if statement regarding the content of the file (array).
function ListFile
dirName = '';
files = dir(fullfile(dirName,'*.txt'));
files = {files.name};
[fid,msg] = fopen(sprintf('output.txt'),'w+t');
assert(fid>=0,msg)
fprintf(fid,'%s\n',files{:});
fclose(fid);
EDIT: The linked solution in Stewie Griffin's comment is way better. Use this!
A simple approach would be to iterate all files, open them, and check their content. Caveat: If you have large files, this approach might be memory intensive.
A possible code for that could look like this:
function ListFile
dirName = '';
files = dir(fullfile(dirName, '*.txt'));
files = {files.name};
fidEmpty = fopen(sprintf('output_empty_files.txt'), 'w+t');
fidNonempty = fopen(sprintf('output_nonempty_files.txt'), 'w+t');
for iFile = 1:numel(files)
content = fileread(files{iFile})
if (isempty(content))
fprintf(fidEmpty, '%s\n', files{iFile});
else
fprintf(fidNonempty, '%s\n', files{iFile});
end
end
fclose(fidEmpty);
fclose(fidNonempty);
I have two non-empty files nonempty1.txt and nonempty2.txt as well as two empty files empty1.txt and empty2.txt. Running this code, I get the following outputs.
Debugging output from fileread:
content =
content =
content = Test
content = Another test
Content of output_empty_files.txt:
empty1.txt
empty2.txt
Content of output_nonempty_files.txt:
nonempty1.txt
nonempty2.txt
Matlab isn't really the optimal tool for this task (although it is capable). To generate the files you're looking for, a command line tool would be much more efficient.
For example, using GNU find you could do
find . -type f -not -empty -ls > notemptyfiles.txt
find . -type f -empty -ls > emptyfiles.txt
to create the text files you desire. Here's a link for doing something comparable using the windows command line. You could also call these functions from within Matlab if you want to using the system command. This would be much faster than iterating over the files from within Matlab.

Reading huge .csv files with matlab - file is not well orgenized

I have several .csv files that I read with matlab using textscan, beause csvread and xlsread do not support this size of a file 200Mb-600Mb.
I use this line to read it:
C = textscan(fileID,'%s%d%s%f%f%d%d%d%d%d%d%d','delimiter',',');
the problem that I have found that sometimes the data is not in this format and then the textscan stop to read in that line without any error.
So what I have done is to read it in this way
C = textscan(fileID,'%s%d%s%f%f%s%s%s%s%s%s%s%s%s%s%s','delimiter',',');
In this way I see the in 2 rows out of 3 milion there is a change in the format.
I want to read all the lines except the bad/different lines.
In addition if its possible to read only the lines that the first string is 'PAA'. is it possible ?
I have tried to load it directly to matlab but its super slow and sometime it get stuck. Or for the realy big one it will announce memory problem.
Any recomendations?
For large files which are still small enough to fit your memory, parsing all lines at once is typically the best choice.
f = fopen('data.txt');
g = textscan(f,'%s','delimiter','\n');
fclose(f);
In a next step you have to identify the lines starting with PAA use strncmp.
Now having your data filtered, apply your textscan expression above to each line. If it fails, try the other.
Matlab is slow with this kind of thing because it needs to load everything into memory. I would suggest using grep/bash/cmd lines to reduce your file to readable lines before processing them in Matlab, in Linux you can:
awk '{if (p ~ /^PAA/ && $1 ~ /^PAA/) print; p=$1}' yourfile.csv > yourNewFile.csv %// This will give you a new file with all the lines that starts with PAA (NOTE: Case sensitive)
To Find lines that does not have the same format, you can use:
awk -F ',' 'NF = 12 {print NR, $0} ' yourfile.csv > yourNewFile.csv
This line looks at 12 delimiters for each line, and discard any line that has more than 12 ",".

Abaqus *.inp file created using Matlab

I was trying to do parametric studies in ABAQUS. I created an *.inp file (master) using GUI in abaqus, then wrote a matlab code to create a new *.inp file using the master. Master *.inp file can be found here and will be required to run the code.
In the new *.inp file everything was same as the master except a few specific lines which I am changing for parametric studies, code is given below. I am getting the files nicely but the problem is ABAQUS can't read the file and gives error messages. By visual inspection I don't find any faults. I guess matlab is writing the *.inp file in some other format which ABAQUS can't interpret.
clc;
%Number of lines to be copied
total_lines=4538; %total number of lines
lines_b4_RP1=4406; % lines before reference point 1
%creating new files
for A=0
for R=[20 30 40 50 100 200 300 400 500]
fileroot = sprintf('P_SHS_120X120X1_NLA_I15_A%dR%d.inp', A,R);
main_inp=fopen('P_SHS_120X120X1_NLA_I15_A0R10.inp','r'); %inputting the main inp file to be copied
wfile=fopen(fileroot,'w+'); %wfile= writing the new file
for i=1:total_lines
data=fgets(main_inp);
if i<lines_b4_RP1
fprintf(wfile,'%s\n', data);
elseif i==lines_b4_RP1
formatline1=('%s\n');
txtline='*Node';
fprintf(wfile, formatline1 ,txtline);
elseif i==(lines_b4_RP1+1)
formatline2=('%d%s%d%s%d%s%d\r\n');
comma=',';
refpt1=1;
xcoord1=R*cosd(A);
ycoord1=R*sind(A);
zcoord1=-20;
fprintf(wfile, formatline2, refpt1,comma,xcoord1,comma,ycoord1,comma,zcoord1);
elseif i==(lines_b4_RP1+2)
fprintf(wfile, formatline1 ,txtline);
elseif i==(lines_b4_RP1+3)
refpt2=2;
xcoord2=R*cosd(A);
ycoord2=R*sind(A);
zcoord2=420;
fprintf(wfile, formatline2 ,refpt2,comma,xcoord2,comma,ycoord2,comma,zcoord2);
elseif i>(lines_b4_RP1+3)
fprintf(wfile,'%s\n', data);
else break;
end
end
fclose(main_inp);
fclose(wfile);
end
end
Thanks in advance.
N.B. A sample *.dat file containing the error message is given here.
You are using fgets to get each line of the input file. From the matlab help:
fgets: Read line from file, keeping newline characters
You then print each line using
fprintf(wfile,'%s\n', data);
This creates two newlines at the end of each data line in the file. A second problem in your file is that you use \r\n in your format specifier. In matlab (unlike C) this will give you two newlines. e.g.
>> fprintf('Hello\rWorld\nFoo\r\nBar\n')
Hello
World
Foo
Bar
>>
I would suggest in future to test this approach with a much simpler format that you can use. Also there is a
*preprint
option that allows you to echo the contents of the input file back into the dat file. This creates big dat files, but it is useful for debugging.

Output from large .csv files generated by Matlab

I am downloading some data in static and time series format using Matlab codes (and toolboxes). The codes give out the results by default in a .csv file for static and time series separately. Although the static data turns out fine, the time series data is huge and the .csv file doesn't load all the data. I tried changing the output file extension to .dta and then also .mat in order to view the output in Stata or Matlab. Also tried writing a little loop where the data loaded into large .csv file can be split into two worksheets within the same file. But none of it has worked. Although I am used to some basic coding in Matlab, I am new to dealing with such large datasets. Any help on this would be very much appreciated.
Thank you- Veronica
How about using these two bits of VBScript to split your CSV file into 2 pieces. There will be 599,999 lines in the first output file (part1.csv) and the rest in the second file (part2.csv).
Save this as part1.vbs
Set fso = CreateObject ("Scripting.FileSystemObject")
Set stdout = fso.GetStandardStream (1)
LineNum=1
Do While Not WScript.StdIn.AtEndOfStream
REM Read in next line of input
Line = WScript.StdIn.ReadLine()
If LineNum<600000 Then
stdout.WriteLine(Line)
End If
LineNum=LineNum+1
Loop
Save this as part2.vbs
Set fso = CreateObject ("Scripting.FileSystemObject")
Set stdout = fso.GetStandardStream (1)
LineNum=1
Do While Not WScript.StdIn.AtEndOfStream
REM Read in next line of input
Line = WScript.StdIn.ReadLine()
If LineNum>=600000 Then
stdout.WriteLine(Line)
End If
LineNum=LineNum+1
Loop
Then you can do this at the Command Prompt to split your file in two:
cscript /nologo part1.vbs YourFile.CSV > part1.csv
cscript /nologo part2.vbs YourFile.CSV > part2.csv

Matlab publish - Want to use a custom file name to publish several pdf files

I have several data log files (here: 34) for those I have to calculate some certain values. I wrote a seperate function to publish the results of the calculation in a pdf file. But I only can publish one file after another, so it takes a while to publish all 34 files.
Now I want to automize that with a loop - importing the data, calculate the values and publish the results for every log file in a new pdf file. I want 34 pdf files for every log file at the end.
My problem is, that I couldn't find a way to rename the pdf files during publishing. The pdf file is always named after the script which is calculating the values. Obviously the pdf is overwritten within a loop. So at the end everything is calculated, but I only have the pdf from the last calculated log file.
There was this hacky solution to change the Matlab publish script, but since I don't have admin rights I can't use that:
"This is really hacky, but I would modify publish to accept a new option prefix. Replace line 93
[scriptDir,prefix] = fileparts(fullPathToScript);
with
if ~isfield(options, 'prefix')
[scriptDir,prefix] = fileparts(fullPathToScript);
else
[scriptDir,~] = fileparts(fullPathToScript);
prefix = options.prefix; end
Now you can set options.prefix to whatever filename you want. If you want to be really hardcore, make the appropriate modifications to supplyDefaultOptions and checkOptionFields as well."
Any suggestions?
Thanks in advance,
Martin
Here's one idea using movefile to rename the resultant published PDF on each iteration:
for i = 1:34
file = publish(files(i)); % Replace with your own command(s)
[pathStr,fileName,ext] = fileparts(file);
newFile = [pathStr filesep() fileName '_' int2str(i) ext]; % Example: append _# to each
[success,msg,msgid] = movefile(file,newFile);
if ~success
error(msgid,msg);
end
end
Also used are fileparts and filesep. See this question for other ways to rename and move files.