follow softlinks when reading file in h5py - h5py

I have a hdf5 file where I have a big dataset containing a Nx3 matrix, to store positions in 3D. This dataset is referenced in several groups using softlinks, as shown in the hierarchy below
/
/POINTS (the big dataset)
/mesh0
/mesh0/POINTS (softlink to /POINTS)
/mesh1
/mesh1/POINTS (softlink to /POINTS)
However, to load this using h5py, I iterate on my groups and if I found a mesh (a group with an attribute called mesh), I assume there is a POINTS dataset and parse it. The issue is that this is creating new numpy matrices for each POINTS dataset.
# This creates a new numpy array, which is inefficient is we are dealing with softlinks
points = mesh_group["POINTS"][::]
I would like to know how to check if the link to the dataset is a softlink, so I can create the matrix only once.

Related

Why am I getting "Unable to read file 'topo60c'. No such file or directory" error in Matlab?

Many of Matlab's Mapping toolbox examples require "topo60c" world map data. Here's an example
load topo60c
axesm hatano
meshm(topo60c,topo60cR)
zlimits = [min(topo60c(:)) max(topo60c(:))];
demcmap(zlimits)
colorbar
However, when I run the above script, Matlab displays a file not found error for "topo60c". Does anyone know why I'm getting this error? I have the Mapping toolbox installed, and it works with other Mapping sample code that doesn't reference that file.
In the acknowledgements section of the mapping toolbox docs there is a note about example data sources:
https://uk.mathworks.com/help/map/dedication-and-acknowledgment.html
Except where noted, the information contained in example and sample data files (found in matlabroot/examples/map/data and matlabroot/toolbox/map/mapdata) is derived from publicly available digital data sets. These data files are provided as a convenience to Mapping Toolbox™ users. MathWorks® makes no claims that any of this data is free of defects or errors, or that the representations of geographic features or names are up to date or authoritative.
You can open these folders from MATLAB (on Windows) using
winopen( fullfile( matlabroot, 'examples/map/data' ) )
winopen( fullfile( matlabroot, 'toolbox/map/mapdata' ) )
Or simply use the fullfile commands above to identify the paths and navigate there yourself.
I can see (MATLAB R2020b) the topo60c file within the first of these folders, which isn't on your path by default because it's within "examples" and not a toolbox directory:
So you could either:
Add this folder to your path so that MATLAB can see the file: addpath(fullfile(matlabroot,'examples/map/data'));
Reference the full file path to the data when running examples: load(fullfile(matlabroot,'examples/map/data/topo60c.mat'));
I would prefer option 2 to avoid changing the path.
Additionally, there is another note in the Raster Geodata section of the docs which details what that dataset should contain
https://uk.mathworks.com/help/map/raster-geodata.html
When raster geodata consists of surface elevations, the map can also be referred to as a digital elevation model/matrix (DEM), and its display is a topographical map. The DEM is one of the most common forms of digital terrain model (DTM), which can also be represented as contour lines, triangulated elevation points, quadtrees, octree, or otherwise.
The topo60c MAT-file, which contains global terrain data, is an example of a DEM. In this 180-by-360 matrix, each row represents one degree of latitude, and each column represents one degree of longitude. Each element of this matrix is the average elevation, in meters, for the one-degree-by-one-degree region of the Earth to which its row and column correspond.
Given that it's generated from publically available data anyway (ref the first docs quote) and you now know what data it represents (ref the 2nd docs quote), you could replicate some replacement data if really needed.

MATLAB: making a histogram plot from csv files read and put into cells?

Unfortunately I am not too tech proficient and only have a basic MATLAB/programming background...
I have several csv data files in a folder, and would like to make a histogram plot of all of them simultaneously in order to compare them. I am not sure how to go about doing this. Some digging online gave a script:
d=dir('*.csv'); % return the list of csv files
for i=1:length(d)
m{i}=csvread(d(i).name); % put into cell array
end
The problem is I cannot now simply write histogram(m(i)) command, because m(i) is a cell type not a csv file type (I'm not sure I'm using this terminology correctly, but MATLAB definitely isn't accepting the former).
I am not quite sure how to proceed. In fact, I am not sure what exactly is the nature of the elements m(i) and what I can/cannot do with them. The histogram command wants a matrix input, so presumably I would need a 'vector of matrices' and a command which plots each of the vector elements (i.e. matrices) on a separate plot. I would have about 14 altogether, which is quite a lot and would take a long time to load, but I am not sure how to proceed more efficiently.
Generalizing the question:
I will later be writing a script to reduce the noise and smooth out the data in the csv file, and binarise it (the csv files are for noisy images with vague shapes, and I want to distinguish these shapes by setting a cut off for the pixel intensity/value in the csv matrix, such as to create a binary image showing these shapes). Ideally, I would like to apply this to all of the images in my folder at once so I can shift out which images are best for analysis. So my question is, how can I run a script with all of the csv files in my folder so that I can compare them all at once? I presume whatever technique I use for the histogram plots can apply to this too, but I am not sure.
It should probably be better to write a script which:
-makes a histogram plot and/or runs the binarising script for each csv file in the folder
-and puts all of the images into a new, designated folder, so I can sift through these.
I would greatly appreciate pointers on how to do this. As I mentioned, I am quite new to programming and am getting overwhelmed when looking at suggestions, seeing various different commands used to apparently achieve the same thing- reading several files at once.
The function csvread returns natively a matrix. I am not sure but it is possible that if some elements inside the csv file are not numbers, Matlab automatically makes a cell array out of the output. Since I don't know the structure of your csv-files I will recommend you trying out some similar functions(readtable, xlsread):
M = readtable(d(i).name) % Reads table like data, most recommended
M = xlsread(d(i).name) % Excel like structures, but works also on similar data
Try them out and let me know if it worked. If not please upload a file sample.
The function csvread(filename)
always return the matrix M that is numerical matrix and will never give the cell as return.
If you have textual data inside the .csv file, it will give you an error for not having the numerical data only. The only reason I can see for using the cell array when reading the files is if the dimensions of individual matrices read from each file are different, for example first .csv file contains data organised as 3xA, and second .csv file contains data organised as 2xB, so you can place them all into a single structure.
However, it is still possible to use histogram on cell array, by extracting the element as an array instead of extracting it as cell element.
If M is a cell matrix, there are two options for extracting the data:
M(i) and M{i}. M(i) will give you the cell element, and cannot be used for histogram, however M{i} returns element in its initial form which is numerical matrix.
TL;DR use histogram(M{i}) instead of histogram(M(i)).

How to use binning method for identifying the incoming point belongs to which bin?

I have small query. I have two data sets. In one data sets for example I did binning and calculated the mean value and std value along with group binning. Now in I have second data sets of same parameters say X. I would like identify this X data sets belong to which bin groups of my previous data sets using matlab.
Could you give some example how to identify the incoming data points belongs to which bin group...??
I used following binning which is available in matlab :
binEdges = linspace(botEdge, topEdge, numBins+1);
[h,whichBin] = histc(x, binEdges);
Well... you already have your bin edges. Anything inside specific edges is in that bin.
If you know that the data is inside the ranges you defined then, for each new data
newdatabin=find(newdata>binedges,1,'last'); %this is the bin number where the new data goes in
h(newdatabin)=h(newdatabin)+1; %add one!
Also consider using histcounts if your MATLAB version is new enough.

Cannot get clustering output Mahout

I am running kmeans in Mahout and as an output I get folders clusters-x, clusters-x-final and clusteredPoints.
If I understood well, clusters-x are centroid locations in each of iterations, clusters-x-final are final centroid locations, and clusteredPoints should be the points being clustered with cluster id and weight which represents probability of belonging to cluster (depending on the distance between point and its centroid). On the other hand, clusters-x and clusters-x-final contain clusters centroids, number of elements, features values of centroid and the radius of the cluster (distance between centroid and its farthest point.
How do I examine this outputs?
I used cluster dumper successfully for clusters-x and clusters-x-final from terminal, but when I used it clusteredPoints, I got an empty file? What seems to be the problem?
And how can I get this values from code? I mean, the centroid values and points belonging to clusters?
FOr clusteredPoint I used IntWritable as key, and WeightedPropertyVectorWritable for value, in a while loop, but it passes the loop like there are no elements in clusteredPoints?
This is even more strange because the file that I get with clusterDumper is empty?
What could be the problem?
Any help would be greatly appreciated!
I believe your interpretation of the data is correct (I've only been working with Mahout for ~3 weeks, so someone more seasoned should probably weigh in on this).
As far as linking points back to the input that created them I've used NamedVector, where the name is the key for the vector. When you read one of the generated points files (clusteredPoints) you can convert each row (point vector) back into a NamedVector and retrieve the name using .getName().
Update in response to comment
When you initially read your data into Mahout, you convert it into a collection of vectors with which you then write to a file (points) for use in the clustering algorithms later. Mahout gives you several Vector types which you can use, but they also give you access to a Vector wrapper class called NamedVector which will allow you to identify each vector.
For example, you could create each NamedVector as follows:
NamedVector nVec = new NamedVector(
new SequentialAccessSparseVector(vectorDimensions),
vectorName
);
Then you write your collection of NamedVectors to file with something like:
SequenceFile.Writer writer = new SequenceFile.Writer(...);
VectorWritable writable = new VectorWritable();
// the next two lines will be in a loop, but I'm omitting it for clarity
writable.set(nVec);
writer.append(new Text(nVec.getName()), nVec);
You can now use this file as input to one of the clustering algorithms.
After having run one of the clustering algorithms with your points file, it will have generated yet another points file, but it will be in a directory named clusteredPoints.
You can then read in this points file and extract the name you associated to each vector. It'll look something like this:
IntWritable clusterId = new IntWritable();
WeightedPropertyVectorWritable vector = new WeightedPropertyVectorWritable();
while (reader.next(clusterId, vector))
{
NamedVector nVec = (NamedVector)vector.getVector();
// you now have access to the original name using nVec.getName()
}
check the parameter named "clusterClassificationThreshold".
clusterClassificationThreshold should be 0.
You can check this http://mail-archives.apache.org/mod_mbox/mahout-user/201211.mbox/%3C50B62629.5020700#windwardsolutions.com%3E

Text Categorization datasets for MATLAB

I am looking for a reliable dataset for Text categorization tasks in MATLAB format.
I want to run some experiments and don't want to spend too much time in preprocessing the text and creating feature vectors. I need something to be ready so I can plug it in my algorithm. I found a MATLAB files for reuters dataset here: link text
Everything is ready in here, but I want to use a subset of this. In this "fea" contains the feature vectors for each document. However, it seems that it is not a normal matrix. I want for example to select the top 1000 documents in this "fea". If you just download it and load it into MATLAB you will see what I mean.
So, If it is possible I need a solution for the above-mentioned dataset or any alternative datasets.
Thanks in advance.
It is stored as sparse matrix. Extract the first 1000 documents (rows), and if you have enough space, you can convert it to full dense matrix:
load Reuters21578.mat
TF = full( fea(1:1000,:) );
Lets check the variables we have:
>> whos
Name Size Bytes Class Attributes
TF 1000x18933 151464000 double
fea 8293x18933 4749196 double sparse
gnd 8293x1 66344 double
testIdx 2347x1 18776 double
trainIdx 5946x1 47568 double
so you can see TF is now about 150MB.
Other than that, the rest is self-explanatory:
fea: term-frequency matrix, rows are documents, columns are terms
gnd: category of each document, where numel(unique(gnd)) == 65
trainIdx/testIdx: split of instances (documents) for classification purposes, contains indices of rows, used as: tr = fea(trainIdx,:); tt = fea(testIdx,:);