I'm running a short code to open one by one a list of files and saving back only one of the variables contained in the files. The process seems to me much slower than I expected and getting slower with time, I don't fully understand why and how I could make it run faster. I always struggle with optimization. I'd appreciate if you have suggestions.
The code is the following (the ... substitute the actual path just for example):
main_dir=dir(strcat('\\storage2-...\Raw\DAQ5\'));
filename={};
for m=7:size(main_dir,1)
m
second_dir=dir([main_dir(m).folder '\' main_dir(m).name '\*.mat']);
for mm=1:numel(second_dir)
filename{end+1}=[second_dir(mm).folder '\' second_dir(mm).name];
for mmm=1:numel(filename)
namefile=sprintf(second_dir(mm,1).name);
load(string(filename(1,mmm)));
save(['\\storage2-...\DAQ5\Ch1_',namefile(end-18:end-4),'.mat'], 'Ch_1_y')
end
end
end
The original file is about 17 MB and once the single variable is saved it is about 6 MB in size.
The Matlab load function takes an optional additional argument to specify just a selected variable to read from the input file.
s = load('path/to/file.mat', 'Ch_1_y');
That way you don't have to spend time loading in all the other variables from those input .mat files that you're just going to immediately throw away.
And using save to save MAT-files over SMB shares can be slow. You might want to call save to write it to a temporary local file first, and then copy the completed file to the final destination. Sounds like more I/O, but it can actually be a net win, depending on your particular system and network. Measure it both ways to see if it's a win in your particular situation.
Related
Is there any way to create an empty .mat file from a terminal session? Basically, what I am doing is brain graph analysis. The software I am using, if an entire brain is scrubbed (ie, if the displacement of the brain is greater than a certain threshold) the output file will be left out or will be very small. When analyzing, however, I need to be able to eliminate both subjects from the analysis if the entire brain is scrubbed/too much of the brain is scrubbed. To accomplish this, the easiest way would be to simply check the dimensions of the output file within matlab, and if they are below the arbitrary threshold I decide then both subjects will just be skipped over for analysis. The issue is, I can easily check if a file contains too few remaining frames, however, if the resulting file contains no frames, it will entirely just not exist. As the outputs are all sorted, the only thing I need to do is check consecutive files' dimensions, and if one of the files does not contain enough values, then I can simply skip over it entirely. Simply touching a blank file obviously will not work, since it will not contain any encoding. I hope this is a good explanation for my motivation to do this, and if any of you know of any suggestions, please let me know.
A simple solution would be to create an empty file from Matlab and duplicate the file when needed from the console.
Just open Matlab, set to the destination folder and type this:
clear all
save empty.mat
Then, when needed, copy the file from the console. :)
Saving the contents of an empty struct creates an empty .mat file:
emptyStruct = struct;
save('myFile.mat','-struct','emptyStruct');
I got a mistake when I use abaqus subroutine to read file with multiple processors(cpus),could you help me to deal with this mistake.thanks a lot
I want to read variables from a file ,when one cpu is used,everything is ok,
but when more than one cpus are used,there will be a mistake,it seems that every cpu repeat the same command.
for example,the following is the contents of the file to read from,file name is data.dat
*matID ,2,1
131000.000, 8880.000, 8180.000
0.324, 0.324, 0.300
3990.000, 5320.000, 5320.000
1871.000, 59.700, 59.700
1291.000, 215.000, 215.000
90.000, 102.000, 102.000
my subroutine is shown as follow:
character*12 check1
integer check2,error
OPEN(10,file='data.dat',status='old',iostat=error)
if (error.EQ.0) then
read(10,*,iostat=error) check1,Nm
end if
close(10)
print *,'Nm=',nm,error
print *,'**'
when I use 2 cpus,the printed results will be :
Nm= 2 0
Nm= 8880 0
**
**
Depending on the reason for reading in data from a file, there are a couple of ways to avoid this problem:
If you only need to access the data once:
Read in the data in a subroutine that is always called in serial. UEXTERNALDB is a good example and can be used so that the file open only happens at the beginning of an analysis or the beginning of an increment as needed. You can then carefully store the information in common blocks. Reading from a common block in parallel should work fine, but do not write to them from the parallel subroutines.
Another way to get in a smaller amount of data is to define solution variables in your input file instead.
If you really need to open this file locally within each parallel thread (can't see why but open to correction), you can use GETNUMCPUS and GETRANK to open different copies of the files within each thread. GETRANK returns an integer giving you the rank/id of the process. I would advise against this method though. If your problem is large enough to warrant using parallel, then you should avoid slowing it down with file reads.
For more info see sections 1.1.31 and 2.1.4 of the Abaqus 6.14 docs.
I am using exist(x, 'file') to check for the existence of a file on my machine. The execution of this command takes FOREVER (over 10 seconds per call!).
My matlabpath is not too long (about 200 entries) and all folders on path are on my local drive (no network).
Why does exist takes forever?
Is there a way to make it run FASTER?
PS,
This call to exist is part of Matlab's execution of loadlibrary. So, if you are calling loadlibrary and you don't know why it takes forever - this question is also for you.
Here's one idea. You could put the directory containing those header files up at the front of the MATLAB path, so when exist() goes looking through the path, it finds them quickly and doesn't have to search through the rest of the entries. If it's spending its time stepping through your path, that may help.
Wow! That was a tough one. Bottom line: Delete %TEMP% files!
I had a few thousands files lying around in %TEMP%. It appears MATLAB really likes to go over and over the TEMP directory.
After clearing the TEMP folder, exist runs in no time!
(Thanks Andrew for the Process Monitor advice!)
exist is a built in Matlab function. It is designed to check existence of other types of objects (such as variables in Matlab) as well as files. Being a built in function, it's not a simple to see how it is coded. At least on Windows, when you call exist('filename','file') it seemingly only makes one API call to the operating system to check the file existence. So either the operating system is taking a long time, or there is some bloat in the exist function making it run slowly. See the solutions from the other posters for ideas on how to make the operating system return its result more quickly
People sometimes complain that running exist('filename','file') in a loop makes the loop very slow, this is due the call taking perhaps milliseconds and looping over a few thousand times. The solution here is to replace
if exist('filename','file')
% your code
with the line
if java.io.File('filename').exists
% your code
For 372 files
Matlab: Elapsed time is 40.207266 seconds. (get a cup of thee)
Java: Elapsed time is 0.122165 seconds. (eye blinking)
I am new to MATLAB programming and some of the syntax escapes me. So I need a little help. Plus I need some complex looping ideas.
Here's the breakdown of what I have:
12 seperate .dat files, each titled something like output_1_x.dat, output_2_x.dat, etc.
each file is actually one piece of a whole that was seperated and processed
each .dat file is approx. 3.9 GB
Here's what I need to do:
create a single file containing all the data from each seperate file, i.e. I need to recreate the original file.
call this complete output file something like output_final.dat
it has to be done in MATLAB, there are no other alternatives (actually there maybe; see note below)
What is implied:
I will have to fread each 3.9 GBfile into chunks or packets, probably 100 mb at a time (using an imbedded loop?)
these packets will have to be read then written sequentially
after one file is read then written into output_final.dat, the next file is automatically read & written (the master loop).
Well, that's pretty much it. I did a search for 'merging mulitple files' and found this. That isn't exactly what I need to do...I don't need to take part of a file, or data from files, and write it to a new one. I'm simply...concatenating...? This would be simple in Java or Perl, but I only have MATLAB as a tool.
Note: I am however running KDE in OpenSUSE on a pretty powerful box. Maybe someone who is also an expert in terminal knows a command/script to do this from the kernel?
So on this site we usually would point you to whathaveyoutried.com but this question is well phrased.
I wont write the code but i will give you how I would do it. So first I am a bit confused about why you need to fread the file. Are you just appending one file onto the end of another?
You can actually use unix commands to achieve what you want:
files = dir('*.dat');
for i = 1:length(files)
string = sprintf('cat %s >> output_final.dat.temp', files(i).name);
unix(string);
end
That code should loop through all the files and pipe all of the content into output_final.dat.temp (then just rename it, we didn't want it to be included in anything);
But if you really want to use fread because you want to parse the lines in some manner then you can use the same process:
files = dir('*.dat');
fidF = fopen('output_final.dat', 'w');
for i = 1:length(files)
fid = fopen(files(i).name);
while(~feof(fid))
string = fgetl(fid) %You may choose to parse the string in some manner here
fprintf(fidF, '%s', string)
end
end
Just remember, if you are not parsing the lines this will take much much longer.
Hope this helps.
I suggest using a matlab.io.matfileclass objects on two of the files:
matObj1 = matfile('datafile1.mat')
matObj2 = matfile('datafile2.mat')
This does not load any data into memory. Then you can use the objects' methods to sequentialy save a variable from one file to another.
matObj1.varName = matObj2.varName
You can get all the variables in one file with fieldnames(mathObj1) and loop through to copy contents from one file to another. You can then clear some space by removing the copied fields. Or you can use a bit more risky procedure by directly moving the data:
matObj1.varName = rmfield(matObj2,'varName')
Just a disclaimer: haven't tried it, use at own risk.
I have implemented a log file that will be storing the cpu and memory state of a process after every minute.I have limited the maximum size of the file to 3MB (thats enough for my purpose).
The script will be called by a cron job after every minute and the script will log the details for that minute and will rename the file as "Log_.log".
When the size reaches "3MB - 100 bytes" I reset the file pointer to point to the begining and will overwrite the first entry in the log file and will now rename the file as "Log_<0+some offset>.log".
As I am renaming the file after every minute to update the file pointer position, is it a good/efficient way ?
I do not want to maintain more than one log file for this purpose.
Another option for me is to maintain the file pointer position in a file ,but ....another file !! not interested in maintaining one if this option is good :)
Thanks in Advance.
Are you an engineer? This is a nice example of some simple task, solved by a perfectly working but overly complex solution.
Unless the content you put in takes exactly as many bytes as the content you take out, writing "in" a file will actually cause the whole following part after your writing position to be rewritten to disk. Append is much cheaper.
Renaming the file to store the pointer works - but it's not very elegant, and makes stuff more complex (for one, your process needs write rights to the directory in which the file resides - else just write access to two files is sufficient)
Unless disk space is an issue (and really, it rarely is), your approach is less efficient than say, append everything to a file, and rotate the file when it reaches its maximum size. This way you always have the last 3MB of logs available, and maximum 3MB more in your current file. It will make parsing the file a lot easier too, instead of recalculating the entire pointer position thing.
Update to answer your comment:
Renaming a file every minute (or even every second) shouldn't slow down your system significantly, don't worry about that.
Our concerns are mainly with "why you think you need to rename the file". It's not better technically, it's not better from a logical point of view, it makes a lot of other (future) tasks harder. You could store the file pointer in a seperate file, or at the end of your file, and there are better^H^H^H^H^H^H simpler solutions that don't require the file pointer at all.
I'm confused why you would rename your file. What does this accomplish?
Are the log entries fixed size? Or variable size?
If the entries are fixed size, then there is no trouble in re-writing the existing file from the start: you won't ever have incomplete entries in your file, and if you are writing a counter or timestamps to the file, it should be clear where the 'cursor' is located.
If the entries are variable size, then you should probably not begin re-writing the file from the beginning without somehow making it clear where the 'cursor' is located in the file, and write code that is resilient to reading truncated log entries.
Can you re-use existing tools such as RRDtool?