Do I understand correctly that HDF5-files should be manually closed like this:
import h5py
file = h5py.File('test.h5', 'r')
...
file.close()
From the documentation: "HDF5 files work generally like standard Python file objects. They support standard modes like r/w/a, and should be closed when they are no longer in use.".
But I wonder: will the garbage collection evoke file.close() when the script terminates or when file is overwritten?
This was answered in the comments a long time ago by #kcw78, but I thought I might as well write it up as a quick answer for anyone else reaching this.
As #kcw78 says, you should explicitly close files when you are done with them by calling file.close(). From previous experience, I can tell you that h5py files are usually closed properly anyway when the script terminates, but occasionally the files would be corrupt (although I'm not sure if that ever happens when in 'r' mode only). Better not to leave it to chance!
As #kcw78 also suggests, using a context manager is a good way to go if you want to be safe. In either case, you need to be careful to actually extract the data you want before letting the file close.
e.g.
import h5py
with h5py.File('test.h5', 'w') as f:
f['data'] = [1,2,3]
# Letting the file close and reopening in read only mode for example purposes
with h5py.File('test.h5', 'r') as f:
dataset = f.get('data') # get the h5py.Dataset
data = dataset[:] # Copy the array into memory
print(dataset.shape, data.shape) # appear to behave the same
print(dataset[0], data[0]) # appear to behave the same
print(data[0], data.shape) # Works same as above
print(dataset[0], dataset.shape) # Raises ValueError: Not a dataset
dataset[0] raises an error here because dataset was an instance of h5py.Dataset which was associated with f and was closed at the same time f was closed. Whereas data is just a numpy array containing only the data part of the dataset (i.e. no additional attributes).
Related
I'm running a short code to open one by one a list of files and saving back only one of the variables contained in the files. The process seems to me much slower than I expected and getting slower with time, I don't fully understand why and how I could make it run faster. I always struggle with optimization. I'd appreciate if you have suggestions.
The code is the following (the ... substitute the actual path just for example):
main_dir=dir(strcat('\\storage2-...\Raw\DAQ5\'));
filename={};
for m=7:size(main_dir,1)
m
second_dir=dir([main_dir(m).folder '\' main_dir(m).name '\*.mat']);
for mm=1:numel(second_dir)
filename{end+1}=[second_dir(mm).folder '\' second_dir(mm).name];
for mmm=1:numel(filename)
namefile=sprintf(second_dir(mm,1).name);
load(string(filename(1,mmm)));
save(['\\storage2-...\DAQ5\Ch1_',namefile(end-18:end-4),'.mat'], 'Ch_1_y')
end
end
end
The original file is about 17 MB and once the single variable is saved it is about 6 MB in size.
The Matlab load function takes an optional additional argument to specify just a selected variable to read from the input file.
s = load('path/to/file.mat', 'Ch_1_y');
That way you don't have to spend time loading in all the other variables from those input .mat files that you're just going to immediately throw away.
And using save to save MAT-files over SMB shares can be slow. You might want to call save to write it to a temporary local file first, and then copy the completed file to the final destination. Sounds like more I/O, but it can actually be a net win, depending on your particular system and network. Measure it both ways to see if it's a win in your particular situation.
I got a mistake when I use abaqus subroutine to read file with multiple processors(cpus),could you help me to deal with this mistake.thanks a lot
I want to read variables from a file ,when one cpu is used,everything is ok,
but when more than one cpus are used,there will be a mistake,it seems that every cpu repeat the same command.
for example,the following is the contents of the file to read from,file name is data.dat
*matID ,2,1
131000.000, 8880.000, 8180.000
0.324, 0.324, 0.300
3990.000, 5320.000, 5320.000
1871.000, 59.700, 59.700
1291.000, 215.000, 215.000
90.000, 102.000, 102.000
my subroutine is shown as follow:
character*12 check1
integer check2,error
OPEN(10,file='data.dat',status='old',iostat=error)
if (error.EQ.0) then
read(10,*,iostat=error) check1,Nm
end if
close(10)
print *,'Nm=',nm,error
print *,'**'
when I use 2 cpus,the printed results will be :
Nm= 2 0
Nm= 8880 0
**
**
Depending on the reason for reading in data from a file, there are a couple of ways to avoid this problem:
If you only need to access the data once:
Read in the data in a subroutine that is always called in serial. UEXTERNALDB is a good example and can be used so that the file open only happens at the beginning of an analysis or the beginning of an increment as needed. You can then carefully store the information in common blocks. Reading from a common block in parallel should work fine, but do not write to them from the parallel subroutines.
Another way to get in a smaller amount of data is to define solution variables in your input file instead.
If you really need to open this file locally within each parallel thread (can't see why but open to correction), you can use GETNUMCPUS and GETRANK to open different copies of the files within each thread. GETRANK returns an integer giving you the rank/id of the process. I would advise against this method though. If your problem is large enough to warrant using parallel, then you should avoid slowing it down with file reads.
For more info see sections 1.1.31 and 2.1.4 of the Abaqus 6.14 docs.
I'm having problems with leaking file descriptors in code I have running in ipython notebook. I'm downloading lots of files with urllib2 and saving them locally. Apparently, urllib2 has a history of leaking file descriptors, which I suspect is causing problem. In the end, I get an IoError: Too many open files.
As a workaround, I periodically close a bunch of sockets using os.close. Unfortunately, ipython notebook has lots of sockets running which I don't want to close.
Is there a way that I can identify which file descriptors/sockets/etc.. belong to ipython?
This isn't really an answer, but a couple of workarounds in case others find themselves here with leaking file descriptor problems.
The first workaround, which is probably better, is to use subprocess.call() to download the files I want with wget. It's been about 4 times as fast as the method below.
The second workaround is to use a couple of handy functions I found on SO (which I can't find at the moment - if you find it, edit this or let me know and I'll link):
import resource
import fcntl
import os
def get_open_fds():
fds = []
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
for fd in range(0, soft):
try:
flags = fcntl.fcntl(fd, fcntl.F_GETFD)
except IOError:
continue
fds.append(fd)
return fds
def get_file_names_from_file_number(fds):
names = []
for fd in fds:
names.append(os.readlink('/proc/self/fd/%d' % fd))
return names
With these, I store the active file descriptors and corresponding names before I start downloading files. I then periodically test the number of open file descriptors, and if it's getting dangerously large, use os.close() on all the ones that aren't in the original list (I check the names too - descriptors themselves get recycled).
It's ugly, and occasionally ipython notebook complains with things like "can't save history" (presumably I've clobbered something it was using), but it is working pretty well otherwise.
I am new to MATLAB programming and some of the syntax escapes me. So I need a little help. Plus I need some complex looping ideas.
Here's the breakdown of what I have:
12 seperate .dat files, each titled something like output_1_x.dat, output_2_x.dat, etc.
each file is actually one piece of a whole that was seperated and processed
each .dat file is approx. 3.9 GB
Here's what I need to do:
create a single file containing all the data from each seperate file, i.e. I need to recreate the original file.
call this complete output file something like output_final.dat
it has to be done in MATLAB, there are no other alternatives (actually there maybe; see note below)
What is implied:
I will have to fread each 3.9 GBfile into chunks or packets, probably 100 mb at a time (using an imbedded loop?)
these packets will have to be read then written sequentially
after one file is read then written into output_final.dat, the next file is automatically read & written (the master loop).
Well, that's pretty much it. I did a search for 'merging mulitple files' and found this. That isn't exactly what I need to do...I don't need to take part of a file, or data from files, and write it to a new one. I'm simply...concatenating...? This would be simple in Java or Perl, but I only have MATLAB as a tool.
Note: I am however running KDE in OpenSUSE on a pretty powerful box. Maybe someone who is also an expert in terminal knows a command/script to do this from the kernel?
So on this site we usually would point you to whathaveyoutried.com but this question is well phrased.
I wont write the code but i will give you how I would do it. So first I am a bit confused about why you need to fread the file. Are you just appending one file onto the end of another?
You can actually use unix commands to achieve what you want:
files = dir('*.dat');
for i = 1:length(files)
string = sprintf('cat %s >> output_final.dat.temp', files(i).name);
unix(string);
end
That code should loop through all the files and pipe all of the content into output_final.dat.temp (then just rename it, we didn't want it to be included in anything);
But if you really want to use fread because you want to parse the lines in some manner then you can use the same process:
files = dir('*.dat');
fidF = fopen('output_final.dat', 'w');
for i = 1:length(files)
fid = fopen(files(i).name);
while(~feof(fid))
string = fgetl(fid) %You may choose to parse the string in some manner here
fprintf(fidF, '%s', string)
end
end
Just remember, if you are not parsing the lines this will take much much longer.
Hope this helps.
I suggest using a matlab.io.matfileclass objects on two of the files:
matObj1 = matfile('datafile1.mat')
matObj2 = matfile('datafile2.mat')
This does not load any data into memory. Then you can use the objects' methods to sequentialy save a variable from one file to another.
matObj1.varName = matObj2.varName
You can get all the variables in one file with fieldnames(mathObj1) and loop through to copy contents from one file to another. You can then clear some space by removing the copied fields. Or you can use a bit more risky procedure by directly moving the data:
matObj1.varName = rmfield(matObj2,'varName')
Just a disclaimer: haven't tried it, use at own risk.
We are currently reading the file line by line which delays to read and complete for all.
we would need to read the file fastly and prgoress with our commands.
the commands which i tried using fork and array just displays me the first set of lines only and not proceeding with pther sets.
please help on it.
Reading a large file takes a fair bit of time - disks are slow, after all. Before you start looking at Perl, first try (assuming you're on a unix-type system):
time cat /path/to/your/large/file >/dev/null
The output will tell you how long it takes to just read that file from disk without doing anything to it. Alternately, open the file in your favorite text editor and time how long it takes to load. Once you have that time, compare it to how long your Perl program takes to read the file. Unless the Perl program takes significantly longer, you're not likely to be able to do anything about it because the time is being spent on getting the data from disk rather than on processing it.
Of course, that's assuming that you actually do need to read the entire file. If you can get by with only reading specific parts of it, then you could create an index file and use that to jump directly to the part that's of interest, but you haven't provided enough information for us to tell whether that would apply to your case or not.
If you need more specific help, please provide a better description of what you mean to accomplish and a small, runnable piece of Perl code which shows how you're currently reading and processing the file so that we can see whether you're doing anything particularly inefficient that can be improved on.