Text Categorization datasets for MATLAB - matlab

I am looking for a reliable dataset for Text categorization tasks in MATLAB format.
I want to run some experiments and don't want to spend too much time in preprocessing the text and creating feature vectors. I need something to be ready so I can plug it in my algorithm. I found a MATLAB files for reuters dataset here: link text
Everything is ready in here, but I want to use a subset of this. In this "fea" contains the feature vectors for each document. However, it seems that it is not a normal matrix. I want for example to select the top 1000 documents in this "fea". If you just download it and load it into MATLAB you will see what I mean.
So, If it is possible I need a solution for the above-mentioned dataset or any alternative datasets.
Thanks in advance.

It is stored as sparse matrix. Extract the first 1000 documents (rows), and if you have enough space, you can convert it to full dense matrix:
load Reuters21578.mat
TF = full( fea(1:1000,:) );
Lets check the variables we have:
>> whos
Name Size Bytes Class Attributes
TF 1000x18933 151464000 double
fea 8293x18933 4749196 double sparse
gnd 8293x1 66344 double
testIdx 2347x1 18776 double
trainIdx 5946x1 47568 double
so you can see TF is now about 150MB.
Other than that, the rest is self-explanatory:
fea: term-frequency matrix, rows are documents, columns are terms
gnd: category of each document, where numel(unique(gnd)) == 65
trainIdx/testIdx: split of instances (documents) for classification purposes, contains indices of rows, used as: tr = fea(trainIdx,:); tt = fea(testIdx,:);

Related

Applying (with as few loops as possible) a function to given elements/voxels (x,y,z) taken from subfields of multiple structs (nifti's) in MATLAB?

I have a dataset of n nifti (.nii) images. Ideally, I'd like to be able to get the value of the same voxel/element from each image, and apply a function to the n data points. I'd like to do this for each voxel/element across the whole image, so that I can reconvert the result back into .nii format.
I've used the Tools for NIfTI and ANALYZE image toolbox to load my images:
data(1)=load_nii('C:\file1.nii');
data(2)=load_nii('C:\file2.nii');
...
data(n)=load_nii('C:\filen.nii');
From which I obtain a struct object with each sub-field containing one loaded nifti. Each of these has a subfield 'img' corresponding to the image data I want to work on. The problem comes from trying to select a given xyz within each img field of data(1) to data(n). As I discovered, it isn't possible to select in this way:
data(:).img(x,y,z)
or
data(1:n).img(x,y,z)
because matlab doesn't support it. The contents of the first brackets have to be scalar for the call to work. The solution from googling around seems to be a loop that creates a temporary variable:
for z = 1:nz
for x = 1:nx
for y = 1:ny
for i=1:n;
points(i)=data(i).img(x,y,z);
end
[p1(x,y,z,:),~,p2(x,y,z)] = fit_data(a,points,b);
end
end
end
which works, but takes too long (several days) for a single set of images given the size of nx, ny, nz (several hundred each).
I've been looking for a solution to speed up the code, which I believe depends on removing those loops by vectorisation, preselecting the img fields (via getfield ?)and concatenating them, and applying something like arrayfun/cellfun/structfun, but i'm frankly a bit lost on how to do it. I can only think of ways to pre-select which themselves require loops, which seems to defeat the purpose of the exercise (though a solution with fewer loops, or fewer nested loops at least, might do it), or fun into the same problem that calls like data(:).img(x,y,z) dont work. googling around again is throwing up ways to select and concatenate fields within a struct, or a given field across multiple structs. But I can't find anything for my problem: select an element from a non-scalar sub-field in a sub-struct of a struct object (with the minimum of loops). Finally I need the output to be in the form of a matrix that the toolbox above can turn back into a nifti.
Any and all suggestions, clues, hints and help greatly appreciated!
You can concatenate images as a 4D array and use linear indexes to speed up calculations:
img = cat(4,data.img);
p1 = zeros(nx,ny,nz,n);
p2 = zeros(nx,ny,nz);
sz = ny*nx*nz;
for k = 1 : sz
points = img(k:sz:end);
[p1(k:sz:end),~,p2(k)] = fit_data(a,points,b);
end

Vectorization or For loop in MATLAB

I want to read data from .txt file to plot a 3D graph in matlab. The data looks like this
T_hor T_ver V_hor V_ver
8,833 -15,43 -11,871 23,604
3,121 -22,78 -9,949 41,712
-8,012 -26,28 -4,317 33,790
-12,697 -20,99 6,948 22,314
-11,960 5,68 2,079 0,469
4,279 -22,17 -10,002 39,791
Each column I've imported as a separate column Vector say T_hor,T_ver,V_hor,V_ver. Now I want to read first 4096 rows of this individual column vector 'T_hor' and then the next 4096 rows of the column vector T_hor. This applies for every column. My Goal was to compute FFT for the 4096 set of values incrementally and store them as column vectors.I was previously using this command to read the values into different column vector.
x1 = T_ver(1:4096);
x2 = T_ver(4097:8193);
I want to make the code look more Logical and avoid unnecessary lines of code. So I tried applying for loop for this but I think I'm doing it in a wrong way.
for X
X = (1:LastValue:4096);
end
I think Vectorization can be more easy and consumes less execution time. Can anyone give me a direction or hints on how to implement this.

Average of values from multiple matrices in Matlab

I have 50 matrices contained in one folder, all of dimension 181 x 360. How do I cycle through that folder and take an average of each corresponding data points across all 50 matrices?
If the matrices are contained within Matlab variables stored using save('filename','VariableName') then they can be opened using load('filename.mat').
As such, you can use the result of filesInDirectory = dir; to get a list of all your files, using a search pattern if appropriate, like files = dir('*.mat');
Next you can use your load command, and then whos to see which variables were loaded. You should consider storing these for ease clearing after each iteration of your loop.
Once you have your matrix loaded (one at a time), you can take averages as you need, probably summing a value across multiple loop iterations, then dividing by a total counter you've been measuring (using perhaps count = count + size(MatrixVar, dimension);).
If you need all of the matrices loaded at once, then you can modify the above idea, to load using a loop, then average outside of the loop. In this case, you may need to take care - but 50*181*360 isn't too bad I suspect.
A brief introduction to the load command can be found at this link. It talks mainly about opening one matrix, then plotting the values, but there are some comments about dealing with headers, if needed, and different ways in which you can open data, if load is insufficient. It doesn't talk about binary files, though.
Note on binary files, based on comment to OP's question:
If the file can be opened using
FID = fopen('filename.dat');
fread(FID, 'float');
then you can replace the steps referring to load above, and instead use a loop to find filenames using dir, open the matrices using fopen and fread, then average as needed, finally closing the files and clearing the matrices.
In this case, probably your file identifier is the only part you're likely to need to change during the loop (although your total will increase, if that's how you want to average your data)
Reshaping the matrix, or inverting it, might make the code clearer (which is good!), but might not be necessary depending on what you're trying to average - it may be that selecting only a subsection of the matrix is sufficient.
Possible example code?
Assuming that all of the files in the current directory are to be opened, and that no files are elsewhere, you could try something like:
listOfFiles = dir('*.dat');
for f = 1:size(listOfFiles,1)
FID = fopen(listOfFiles(f).name);
Data = fread(FID, 'float');
% Reshape if needed?
Total = Total + sum(Data(start:end,:)); % This might vary, depending on what you want to average etc.
Counter = Counter + (size(Data,1) * size(Data,2)); % This product will be the 181*360 you had in the matrix, in this case
end
Av = Total/Counter;

Import multiple tab delimited files into matlab from different subdirectories

Sorry I am new to matlab.
What I have: A folder containing about 80 subfolders, labeled Day01, Day02, Day03, etc. Each subfolder has a file called "sample_ids.txt" It is a n x m matrix in a tab delimited format.
What I need: 1 data structure that is an array of matrices, where each matrix is the data from "sample_ids.txt" and it should be in the alphabetical order of Day01, Day02, Day03, etc.
I have no idea how to get from point A to point B. Any guidance would be greatly appreciated.
You can decompose this problem into two parts: finding the files, and reading them into memory.
Finding the files is pretty easy, and has already been covered on StackOverflow.
For loading them into memory, you want a multidimensional array, which is as simple as creating a regular array and start using more index dimensions: A = ones(2); A(:,:,2) = ones(2); will, for example, give you a 3-dimensional array of size 2-by-2-by-2, with ones all over.
What you want, is probably want something like this:
A = [] % No prealocation. Fix for speed-up.
files = dir('./Day*/sample_ids.txt');
for file = files
temp = load(file.name);
A(:,:,size(A,3)+1) = temp;
end
disp(A) % display the contents of A afterards...
I haven't tested this code extensively, but it should work OK.
A few important points:
All files must contain matrices of the exact same dimensions - MATLAB can't handle arrays that have different dimensions in different layers (at least not with regular arrays - you could use cell arrays, but that quickly becomes more complicated...). Think of it as trying to build a matrix from vectors of different lengths.
If you have a lot of data, and you know how much, you can save a lot of time by pre-allocating A. This is as easy as A = zeros(k,l,m) for m datafiles with k rows and l columns in each. If you do this, you'll also have to figure out the index of the current file, so you can use that as the third index in the assignment (on the second line in the loop block). I leave this as an internet research excersize :)

Conversion from Matlab CSC to CSR format

I am using mex bridge to perform some operations on Sparse matrices from Matlab.
For that I need to convert input matrix into CSR (compressed row storage) format, since Matlab stores the sparse matrices in CSC (compressed column storage).
I was able to get value array and column_indices array. However, I am struggling to get row_pointer array for CSR format.Is there any C library that can help in conversion from CSC to CSR ?
Further, while writing a CUDA kernel, will it be efficient to use CSR format for sparse operations or should I just use following arrays :- row indices, column indices and values?
Which on would give me more control over the data, minimizing the number for-loops in the custom kernel?
Compressed row storage is similar to compressed column storage, just transposed. So the simplest thing is to use MATLAB to transpose the matrix before you pass it to your MEX file. Then, use the functions
Ap = mxGetJc(spA);
Ai = mxGetIr(spA);
Ax = mxGetPr(spA);
to get the internal pointers and treat them as row storage. Ap is row pointer, Ai is column indices of the non-zero entries, Ax are the non-zero values. Note that for symmetric matrices you do not have to do anything at all! CSC and CSR are the same.
Which format to use heavily depends on what you want to do with the matrix later. For example, have a look at matrix formats for Sparse matrix vector multiplication. That is one of the classic papers, research has moved since then so you can look around further.
I ended up converting CSC format from Matlab to CSR using CUSP library as follows.
After getting the matrix A from matlab and I got its row,col and values vectors and I copied them in respective thrust::host_vector created for each of them.
After that I created two cusp::array1d of type Indices and Values as follows.
typedef typename cusp::array1d<int,cusp::host_memory>Indices;
typedef typename cusp::array1d<float,cusp::host_memory>Values;
Indices row_indices(rows.begin(),rows.end());
Indices col_indices(cols.begin(),cols.end());
Values Vals(Val.begin(),Val.end());
where rows, cols and Val are thrust::host_vector that I got from Matlab.
After that I created a cusp::coo_matrix_view as given below.
typedef cusp::coo_matrix_view<Indices,Indices,Values>HostView;
HostView Ah(m,n,NNZa,row_indices,col_indices,Vals);
where m,n and NNZa are the parameters that I get from mex functions of sparse matrices.
I copied this view matrix to cusp::csr_matrixin device memory with proper dimensions set as given below.
cusp::csr_matrix<int,float,cusp::device_memory>CSR(m,n,NNZa);
CSR = Ah;
After that I just copied the three individual content arrays of this CSR matrix back to the host using thrust::raw_pointer_cast where arrays with proper dimension are already mxCalloced as given below.
cudaMemcpy(Acol,thrust::raw_pointer_cast(&CSR.column_indices[0]),sizeof(int)*(NNZa),cudaMemcpyDeviceToHost);
cudaMemcpy(Aptr,thrust::raw_pointer_cast(&CSR.row_offsets[0]),sizeof(int)*(n+1),cudaMemcpyDeviceToHost);
cudaMemcpy(Aval,thrust::raw_pointer_cast(&CSR.values[0]),sizeof(float)*(NNZa),cudaMemcpyDeviceToHost);
Hope this is useful to anyone who is using CUSP with Matlab
you can do something like this:
n = size(M,1);
nz_num = nnz(M);
[col,rowi,vals] = find(M');
row = zeros(n+1,1);
ll = 1; row(1) = 1;
for l = 2:n
if rowi(l)~=rowi(l-1)
ll = ll + 1;
row(ll) = l;
end
end
row(n+1) = nz_num+1;`
It works for me, hope it can help somebody else!