Scipy: Sparse matrix multiplication memory error - scipy

I want to perform matrix multiplication between a sparse matrix and its transpose, (their are big matrices). Specifically, I have:
C = csc_matrix(...)
Ct = csc_matrix.transpose(C)
L = Ct*C
and shapes:
C.shape
(1791489, 28508141)
Ct.shape
(28508141, 1791489)
And I am getting the following error:
Traceback (most recent call last):
File "C:\...\modularity.py", line 373, in <module>
L = Ct*C
File "C:\...\anaconda3\lib\site-packages\scipy\sparse\base.py", line 480, in __mul__
return self._mul_sparse_matrix(other)
File "C:\...\anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 518, in _mul_sparse_matrix
indices = np.empty(nnz, dtype=idx_dtype)
MemoryError: Unable to allocate 1.11 TiB for an array with shape (152087117507,) and data type int64
I cannot figure out why, why does it try to allocate memory for such a huge array ?
Update: Currently I am trying to do the multiplication in chunks like this
chunksize=1000
numiter = Ct.shape[0]//chunksize
blocks=[]
for i in range(numiter):
A = Ct[i*chunksize:(i+1)*chunksize].dot(C)
blocks.append(A)
But I get:
MemoryError: Unable to allocate 217. MiB for an array with shape (57012620,) and data type int32

For future viewers who want to multiply huge sparse matrices I solved my problem using PyTables and saved the result of the multiplication in chunks. Still it creates a big file but at least is compressed. The code I used goes like this:
import tables as tb
f = tb.open_file('D:\dot.h5', 'w')
l, m, n = Ct.shape[0], Ct.shape[1], C.shape[1]
filters = tb.Filters(complevel=8, complib='blosc')
out_data = f.create_earray(f.root, 'data', tb.Int32Atom(), shape=(0,), filters=filters)
out_indices = f.create_earray(f.root, 'indices', tb.Int32Atom(),shape=(0,), filters=filters)
out_indptr = f.create_earray(f.root, 'indptr', tb.Int32Atom(), shape=(0,), filters=filters)
out_indptr.append(np.array([0])) #this is needed as a first indptr
max_indptr = 0
#buffersize
bl = 10000
for i in range(0, l, bl):
res = Ct[i:min(i+bl, l),:].dot(C)
out_data.append(res.data)
indices = res.indices
indptr = res.indptr
out_indices.append(indices)
out_indptr.append(max_indptr+indptr[1:])
max_indptr += indices.shape[0]
So if for example you want access to the 2nd row of your final matrix you simply can:
L2 = csr_matrix((a.data[a.indptr[2]:a.indptr[2+1]], a.indices[a.indptr[2]:a.indptr[2+1]], np.array([0,len(a.indices[a.indptr[2]:a.indptr[2+1]])])), shape=(1,n))

Related

How do I load multiple files in MatLab?

I have the following code to load a single matrix file in matlab
filename='1';
filetype='.txt';
filepath='D:\20170913\';
fidi = fopen(strcat(filepath,filename,filetype));
Datac = textscan(fidi, repmat('%f', 1, 640), 'HeaderLines',1,
'CollectOutput',1);
f1 = Datac{1};
sum(sum(f1))
how can I load many files, say 1-100.
Thanks in advance.
Just include everything in a for loop that loops from 1 up to N_files, which is the number of files that you have. I used the function num2str() to convert the index i to a string. I also included the matrix sum in an array file_sums and a loaded_matrices cell array that stores all the loaded matrices. If all the loaded matrices have a known and same dimensions, you could use a 2D array (ex. loaded_matrices = zeros(N_rows, N_columns, N_files); and then load the data as loaded_matrices(:,:,i) = Datac{1};).
% N_files - the number of files that you have
N_files = 100;
file_sums = zeros(1,N_files);
loaded_matrices = cell(1,N_files);
for i=1:1:N_files
filename=num2str(i);
filetype='.txt';
filepath='';
fidi = fopen(strcat(filepath,filename,filetype));
Datac = textscan(fidi, repmat('%f', 1, 640), 'HeaderLines',1,...
'CollectOutput',1);
loaded_matrices{i} = Datac{1};
file_sums(i) = sum(sum(loaded_matrices{i}));
end

how to get rid of exceed matrix dimension (binning data)?

bet{j,3} = react{j};
numBins = {};
edges = linspace(min(bet{j,3}), max(bet{j,3}), numBins(bet{j,3}));
[N, whichBin] = histc(bet{j,3}, edges);
binsize = NaN*zeros(size(bins));
for k = 1:numBins
bin = find(whichBin == k);
binMembers = bet{j,3}(bin);
if (~isempty(bin))
mu(k) = mean(y(bin));
end
end
error on
edges = linspace(min(bet{j,3}), max(bet{j,3}), numBins(bet{j,3})); that says it exceeds matrix dimensions
Any suggestions to what could be the problem, as well as suggestions if this is code might work for binning data (e.g., reaction time)?
Your line numBins = {}; creates an empty cell array. But in numBins(bet{j,3})); you are trying to access an element. As there is none it fails on index exceeds matrix dimension.

Reading a .dat file into MATLAB variables

I've used neworkx to generate a random geometric graph on 50 nodes, and create a .dat file with some attributes of this network.
I need to access these as MATLAB variables. I read the file in as a data string using:
fid = fopen('mydata.dat','r')
data = textscan(fid, '%s')
fclose(fid)
The structure of the data file is as follows
conn = val
Adj = val ..... val
.............
val ......val
pos =
[0.7910629988376467, 0.5523474928588686]
...
[0.6799716933198028, 0.6981655240935597]
i.e. conn is a number, Adj is (supposed to be) a 50 by 50 matrix and pos is a 50 by 2 matrix.
I can read conn, and Adj as MATLAB variables fine, but I'm having trouble reading pos. The first instance starts at data{1}{2508}, and is
data{1}{2508}
>>> [0.7832623541518583,
How do I shoehorn this into a 50 by 2 (or 2 by 50) matrix?
To read the Adj I use
P = 50 %number of nodes
index = 5
for i=1:P
for j = 1:P
Adj(i,j) = str2double(data{1}(index + P*(i-1) +j))
end
end
I thought something similar would work for pos, but with j = 1:2 and index = 2508 but I'm getting NaNs as the lines (fields?) of my .dat file aren't just values, they're of the form [val, or ,val]
You can first delete all characters you don't want to have.
data = regexprep(data{1},'[\[\],]','');
After that, your loop should succeed. However, you can speed up your code by using array functions.
Find the occurance of pos
ind = find(strcmp(data,'pos')); # Should be 2506 in your case
After that, once you know that your array is 50x2 use:
pos = str2double(reshape(data(pos+2:end),2,50)')
Note, the +2 is for pos and =.

MATLAB: vectors of different length

I want to create a MATLAB function to import data from files in another directory and fit them to a given model, but because the data need to be filtered (there's "thrash" data in different places in the files, eg. measurements of nothing before the analyzed motion starts).
So the vectors that contain the data used to fit end up having different lengths and so I can't return them in a matrix (eg. x in my function below). How can I solve this?
I have a lot of datafiles so I don't want to use a "manual" method. My function is below. All and suggestions are welcome.
datafit.m
function [p, x, y_c, y_func] = datafit(pattern, xcol, ycol, xfilter, calib, p_calib, func, p_0, nhl)
datafiles = dir(pattern);
path = fileparts(pattern);
p = NaN(length(datafiles));
y_func = [];
for i = 1:length(datafiles)
exist(strcat(path, '/', datafiles(i).name));
filename = datafiles(i).name;
data = importdata(strcat(path, '/', datafiles(i).name), '\t', nhl);
filedata = data.data/1e3;
xdata = filedata(:,xcol);
ydata = filedata(:,ycol);
filter = filedata(:,xcol) > xfilter(i);
x(i,:) = xdata(filter);
y(i,:) = ydata(filter);
y_c(i,:) = calib(y(i,:), p_calib);
error = #(par) sum(power(y_c(i,:) - func(x(i,:), par),2));
p(i,:) = fminsearch(error, p_0);
y_func = [y_func; func(x(i,:), p(i,:))];
end
end
sample data: http://hastebin.com/mokocixeda.md
There are two strategies I can think of:
I would return the data in a vector of cells instead, where the individual cells store vectors of different lengths. You can access data the same way as arrays, but use curly braces: Say c{1}=[1 2 3], c{2}=[1 2 10 8 5] c{3} = [ ].
You can also filter the trash data upon reading a line, if that makes your vectors have the same length.
If memory is not an major issue, try filling up the vectors with distinct values, such as NaN or Inf - anything, that is not found in your measurements based on their physical context. You might need to identify the longest data-set before you allocate memory for your matrices (*). This way, you can use equally sized matrices and easily ignore the "empty data" later on.
(*) Idea ... allocate memory based on the size of the largest file first. Fill it up with e.g. NaN's
matrix = zeros(length(datafiles), longest_file_line_number) .* NaN;
Then run your function. Determine the length of the longest consecutive set of data.
new_max = length(xdata(filter));
if new_max > old_max
old_max = new_max;
end
matrix(i, length(xdata(filter))) = xdata(filter);
Crop your matrix accordingly, before the function returns it ...
matrix = matrix(:, 1:old_max);

Possibly incorrect Matlab error: "Subscripted assignment dimension mismatch"

Matlab is giving me the error, "Subscripted assignment dimension mismatch" however I don't think there should be an issue. The code is below but basically I have a temp matrix that mimics the dimensions of another matrix, testData (actually a subset of it). I can assign the output of imread to the temp matrix but not to a subset of testData that has the same dimensions. I can even use the size function to prove they are the same dimensions yet one works and one doesn't. So I set temp = imread and then testData = temp and it works. But why should I have to do that?
fileNames = dir('Testing\*.pgm');
numFiles = size(fileNames, 1);
testData = zeros(32257, numFiles);
temp = zeros(32256, 1);
for i = 1 : numFiles,
fileName = fileNames(i).name;
% Extracts some info from the file's name and stores it in the first row
testData(1, i) = str2double(fileName(6:7));
% Here temp has the same dimensions as testData(2:end, i)
% yet testData(2:end, i) = imread(fileName) doesn't work
% however it works if I use temp as a "middleman" variable
temp(:) = imread(fileName);
testData(2:end, i) = temp(:);
end
If the file that you're reading is a color image, imread returns an MxNx3 array. You can't assign a 3D array to a 1D vector without reshaping it, even if it contains the same number of elements. That's probably why you get the error when you try to assign the output of imread directly to testData. However, when you use an intermediate variable and collapse it into a column vector, the assignment works because now you're assigning a 1D vector to another 1D vector of equal size.
If you don't want to use an additional step, try this
testData(2:end,i)=reshape(imread(fileName),32256,1);