Does h5py offer a neat simple way to create or use a group or dataset? - h5py

I'm fixing a python script using h5py. It contains code like this:
hdf = h5py.File(hdf5_filename, 'a')
...
g = hdf.create_group('foo')
g.create_dataset('bar', ...whatever...)
Sometimes this runs on a file which already has a group named 'foo', in which case I see "ValueError: Unable to create group (Name already exists)"
One way to fix this is to replace the one simple line with create_group with four lines, like this:
if 'foo' in hdf.keys():
g = hdf['foo']
else:
g = hdf.create_group['foo']
g.create_dataset(...etc...)
Is there a neater way to do this, maybe in only one line? Like how with files in the standard C library, 'a' mode will either append to an existing file, or create a file if it's not already there.
Same goes for datasets - I have
create_dataset('bar', ...)
but should check first:
if 'bar' in g.keys():
d = g['bar']
else:
d = g.create_dataset('bar')
My wish: to find h5py has methods named create_or_use_group() and create_or_use_dataset(). What actually exists?

Yes: require_group and require_dataset:
with h5py.File("/tmp/so_hdf5/test2.h5", 'w') as f:
a = f.create_dataset('a',data=np.random.random((10, 10)))
with h5py.File("/tmp/so_hdf5/test2.h5", 'r+') as f:
a = f.require_dataset('a', shape=(10, 10), dtype='float64')
d = f.require_dataset('d', shape=(20, 20), dtype='float64')
g = f.require_group('foo')
print(a)
print(d)
print(g)
Note that you do need to know the shape and dtype of the dataset, otherwise require_dataset throws a TypeError. In that case, you could do something like:
try:
a = f.require_dataset('a', shape=(10, 10), dtype='float64')
except TypeError:
a = f['a']
If you don't already know the shape and dtype, I don't think there's much advantage to require_dataset over using try ... except ...

Related

Matlab: read multiple files

my matlab script read several wav files contained in a folder.
Each read signal is saved in cell "mat" and each signal is saved in array. For example,
I have 3 wav files, I read these files and these signals are saved in arrays "a,b and c".
I want apply another function that has as input each signal (a, b and c) and the name of corresponding
file.
dirMask = '\myfolder\*.wav';
fileRoot = fileparts(dirMask);
Files=dir(dirMask);
N = natsortfiles({Files.name});
C = cell(size(N));
D = cell(size(N));
for k = 1:numel(N)
str =fullfile(fileRoot, Files(k).name);
[C{k},D{k}] = audioread(str);
mat = [C(:)];
fs = [D(:)];
a=mat{1};
b=mat{2};
c=mat{3};
myfunction(a,Files(1).name);
myfunction(b,Files(2).name);
myfunction(c,Files(3).name);
end
My script doesn't work because myfunction considers only the last Wav file contained in the folder, although
arrays a, b and c cointain the three different signal.
If I read only one wav file, the script works well. What's wrong in the for loop?
Like Cris noticed, you have some issues with how you structured your for loop. You are trying to use 'b', and 'c' before they are even given any data (in the second and third times through the loop). Assuming that you have a reason for structuring your program the way you do (I would rewrite the loop so that you do not use 'a','b', or 'c'. And just send 'myfunction' the appropriate index of 'mat') The following should work:
dirMask = '\myfolder\*.wav';
fileRoot = fileparts(dirMask);
Files=dir(dirMask);
N = natsortfiles({Files.name});
C = cell(size(N));
D = cell(size(N));
a = {};
b = {};
c = {};
for k = 1:numel(N)
str =fullfile(fileRoot, Files(k).name);
[C{k},D{k}] = audioread(str);
mat = [C(:)];
fs = [D(:)];
a=mat{1};
b=mat{2};
c=mat{3};
end
myfunction(a,Files(1).name);
myfunction(b,Files(2).name);
myfunction(c,Files(3).name);
EDIT
I wanted to take a moment to clarify what I meant by saying that I would not use the a, b, or c variables. Please note that I could be missing something in what you were asking so I might be explaining things you already know.
In a certain scenarios like this it is possible to articulate exactly how many variables you will be using. In your case, you know that you have exactly 3 audio files that you are going to process. So, variables a, b, and c can come out. Great, but what if you have to throw another audio file in? Now you need to go back in and add a 'd' variable and another call to 'myfunction'. There is a better way, that not only reduces complexity but also extends functionality to the program. See the following code:
%same as your code
dirMask = '\myfolder\*.wav';
fileRoot = fileparts(dirMask);
Files = dir(dirMask);
%slight variable name change, k->idx, slightly more meaningful.
%also removed N, simplifying things a little.
for idx = 1:numel(Files)
%meaningful variable name change str -> filepath.
filepath = fullfile(fileRoot, Files(idx).name);
%It was unclear if you were actually using the Fs component returned
%from the 'audioread' call. I wanted to make sure that we kept access
%to that data. Note that we have removed 'mat' and 'fs'. We can hold
%all of that data inside one variable, 'audio', which simplifies the
%program.
[audio{idx}.('data'), audio{idx}.('rate')] = audioread(filepath);
%this function call sends exactly the same data that your version did
%but note that we have to unpack it a little by adding the .('data').
myfunction(audio{idx}.('data'), Files(idx).name);
end

Can operations on a numpy.memmap be deferred?

Consider this example:
import numpy as np
a = np.array(1)
np.save("a.npy", a)
a = np.load("a.npy", mmap_mode='r')
print(type(a))
b = a + 2
print(type(b))
which outputs
<class 'numpy.core.memmap.memmap'>
<class 'numpy.int32'>
So it seems that b is not a memmap any more, and I assume that this forces numpy to read the whole a.npy, defeating the purpose of the memmap. Hence my question, can operations on memmaps be deferred until access time?
I believe subclassing ndarray or memmap could work, but don't feel confident enough about my Python skills to try it.
Here is an extended example showing my problem:
import numpy as np
# create 8 GB file
# np.save("memmap.npy", np.empty([1000000000]))
# I want to print the first value using f and memmaps
def f(value):
print(value[1])
# this is fast: f receives a memmap
a = np.load("memmap.npy", mmap_mode='r')
print("a = ")
f(a)
# this is slow: b has to be read completely; converted into an array
b = np.load("memmap.npy", mmap_mode='r')
print("b + 1 = ")
f(b + 1)
Here's a simple example of an ndarray subclass that defers operations on it until a specific element is requested by indexing.
I'm including this to show that it can be done, but it almost certainly will fail in novel and unexpected ways, and require substantial work to make it usable.
For a very specific case it may be easier than redesigning your code to solve the problem in a better way.
I'd recommend reading over these examples from the docs to help understand how it works.
import numpy as np
class Defered(np.ndarray):
"""
An array class that deferrs calculations applied to it, only
calculating them when an index is requested
"""
def __new__(cls, arr):
arr = np.asanyarray(arr).view(cls)
arr.toApply = []
return arr
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
## Convert all arguments to ndarray, otherwise arguments
# of type Defered will cause infinite recursion
# also store self as None, to be replaced later on
newinputs = []
for i in inputs:
if i is self:
newinputs.append(None)
elif isinstance(i, np.ndarray):
newinputs.append(i.view(np.ndarray))
else:
newinputs.append(i)
## Store function to apply and necessary arguments
self.toApply.append((ufunc, method, newinputs, kwargs))
return self
def __getitem__(self, idx):
## Get index and convert to regular array
sub = self.view(np.ndarray).__getitem__(idx)
## Apply stored actions
for ufunc, method, inputs, kwargs in self.toApply:
inputs = [i if i is not None else sub for i in inputs]
sub = super().__array_ufunc__(ufunc, method, *inputs, **kwargs)
return sub
This will fail if modifications are made to it that don't use numpy's universal functions. For instance percentile and median aren't based on ufuncs, and would end up loading the entire array. Likewise, if you pass it to a function that iterates over the array, or applies an index to substantial amounts the entire array will be loaded.
This is just how python works. By default numpy operations return a new array, so b never exists as a memmap - it is created when + is called on a.
There's a couple of ways to work around this. The simplest is to do all operations in place,
a += 1
This requires loading the memory mapped array for reading and writing,
a = np.load("a.npy", mmap_mode='r+')
Of course this isn't any good if you don't want to overwrite your original array.
In this case you need to specify that b should be memmapped.
b = np.memmap("b.npy", mmap+mode='w+', dtype=a.dtype, shape=a.shape)
Assigning can be done by using the out keyword provided by numpy ufuncs.
np.add(a, 2, out=b)

How to replace/modify something in a call to function 1 from within function 2 (both in their separate files)

The given task is to call a function from within another function, where both functions are handling matrices.
Now lets call this function 1 which is in its own file:
A = (1/dot(v,v))*(Ps'*Ps);
Function 1 is called with the command:
bpt = matok(P);
Now in another file in the same folder where function 1 is located (matok.m) we make another file containing function 2 that calls function 1:
bpt = matok(P);
What I wish B to do technically, is to return the result of the following (where D is a diagonal matrix):
IGNORE THIS LINE: B = (1/dot(v,v))*(Ps'*inv(D)*Ps*inv(D);
EDIT: this is the correct B = (1/dot(v,v))*(Ps*inv(D))'*Ps*inv(D);
But B should not "re-code" what has allready been written in function 1, the challenge/task is to call function 1 within function 2, and within function 2 we use the output of function 1 to end up with the result that B gives us. Also cause in the matrix world, AB is not equal to BA, then I can't simply multiply with inv(D) twice in the end. Now since Im not allowed to write B as is shown above, I was thinking of replacing (without altering function 1, doing the manipulation within function 2):
(Ps'*Ps)
with
(Ps'*inv(D)*Ps*inv(D)
which in some way I imagine should be possible, but since Im new to Matlab have no idea how to do or where even to start. Any ideas on how to achieve the desired result?
A small detail I missed:
The transpose shouldn't be of Ps in this:
B = (1/dot(v,v))*(Ps'*inv(D))*Ps*inv(D);
But rather the transpose of Ps and inv(D):
B = (1/dot(v,v))*(Ps*inv(D))'*Ps*inv(D);
I found this solution, but it might not be as compressed as it could've been and it seems a bit unelegant in my eyes, maybe there is an even shorter way?:
C = pinv(Ps') * A
E = (Ps*inv(D))' * C
Since (A*B)' = B'*A', you probably just need to call
matok(inv(D) * Ps)

use macro to extract several object fields in Julia

I have a structure, from which I want to access repeatedly the fields to I load them in the current space like this (where M is type with fields X and Y):
X = M.X
Y = M.Y
in R, I often use the with command to do that. For now I would just like to be able to have a macro that expends that code, something along the lines of
#attach(M,[:X,:Y])
I am just not sure how exactly to do this.
I've included in this answer a macro that does pretty much what you describe. Comments explaining what's going on are inline.
macro attach(struct, fields...)
# we want to build up a block of expressions.
block = Expr(:block)
for f in fields
# each expression in the block consists of
# the fieldname = struct.fieldname
e = :($f = $struct.$f)
# add this new expression to our block
push!(block.args, e)
end
# now escape the evaled block so that the
# new variable declarations get declared in the surrounding scope.
return esc(:($block))
end
You use it like this: #attach M, X, Y
you can see the generated code like so: macroexpand(:(#attach M, X, Y)) which will show something like this:
quote
X = M.X
Y = M.Y
end

Equivalent of Fieldnames function for cells

As the title says, just wondering if there is a function that works as fieldnames (http://www.mathworks.co.uk/help/matlab/ref/fieldnames.html) does, but for cells.
So if I have something like:
a = imread('redsquare.bmp');
b = imread('bluesquare.bmp');
c = imread('yellowsquare.bmp');
d = imread('greysquare.bmp');
e = {a, b, c, d};
I'm trying to retrieve either: a, b, c, d OR the image name without the extension.
I have tried fn = fileparts(e) and fntemp = cell2struct(e,2), but I can't get it working.
Hope this makes sense
Thanks
The cell array does not include any meta-information, like field name or file name. If you want access to that information you'll need to change your data storage structure. Some options include:
Scalar Structure
Good for when there is a single name to reference.
images.red = imread('redsquare.bmp');
images.blue = imread('bluesquare.bmp');
Use fieldnames(images) to get the names.
Array of structures
A little bit more general. Allows completely general names (including special characters and spaces) and additional metadata if you need it (like "size", "author")
images(1).name = 'red';
images(1).im = imread('redsquare.bmp');
images(2).name = 'blue';
images(3).im = imread('bluesquare.bmp');
Use {fieldnames.name} to get just the names.
Containers.map
Probably more than you need here, but good to know about. help comtainers.map for more.