Matlab from text file to sparse matrix. - matlab

I have a huge text file in the following format:
1 2
1 3
1 10
1 11
1 20
1 376
1 665255
2 4
2 126
2 134
2 242
2 247
First column is the x coordinate while second column is the y coordinate.
It indicates that if I had to construct a Matrix
M = zeros(N, N);
M(1, 2) = 1;
M(1, 3) = 1;
.
.
M(2, 247) = 1;
This text file is huge and can't be brought to main memory at once. I must read it line by line. And save it in a sparse matrix.
So I need the following function:
function mat = generate( path )
fid = fopen(path);
tline = fgetl(fid);
% initialize an empty sparse matrix. (I know I assigned Mat(1, 1) = 1)
mat = sparse(1);
while ischar(tline)
tline = fgetl(fid);
if ischar(tline)
C = strsplit(tline);
end
mat(C{1}, C{2}) = 1;
end
fclose(fid);
end
But unfortunately besides the first row it just puts trash in my sparse mat.
Demo:
1 7
1 9
2 4
2 9
If I print the sparse mat I get:
(1,1) 1
(50,52) 1
(49,57) 1
(50,57) 1
Any suggestions ?

Fixing what you have...
Your problem is that C is a cell array of characters, not numbers. You need to convert the strings you read from the file into integer values. Instead of strsplit you can use functions like str2num and str2double. Since tline is a space-delimited character array of integers in this case, str2num is the easiest to use to compute C:
C = str2num(tline);
Then you just index C like an array instead of a cell array:
mat(C(1), C(2)) = 1;
Extra tidbit: If you were wondering how your demo code still worked even though C contained characters, it's because MATLAB has a tendency to automatically convert variables to the correct type for certain operations. In this case, the characters were converted to their double ASCII code equivalents: '1' became 49, '2' became 50, etc. Then it used these as indices into mat.
A simpler alternative...
You don't even have to bother with all that mess above, since you can replace your entire function with a much simpler approach using dlmread and sparse like so:
data = dlmread(filePath);
mat = sparse(data(:, 1), data(:, 2), 1);
clear data; % Save yourself some memory if you don't need it any more

Related

MATLAB. How to remove rows, if any of the values in the row is found in another row?

I have a matrix as shown in the image. In this matrix if any of the values in one row is found in another row we remove the shorter row. For example row 2 to row 5 all contain 3, therefore I want to keep only row 5(the row with most non-zero values) and remove all other rows...please suggest a solution.
Thanks
I believe the below code should work. The idea is to sort the matrix first according to the number of elements in the rows, then loop and remove the rows that have matches. Probably not the most efficient code but should work in principle.. see the comments for more explanation
% generating the data
M = zeros(6, 10);
M(2,1:3) = [3 8 10];
M(3,1:4) = [3 8 10 9];
M(4,1:5) = [3 8 10 9 7];
M(5,1:6) = [3 8 10 9 7 4];
M(6,1) = [5];
% sorting according to the number of non-zero elements
nr_of_nonzero = sum(M~=0, 2);
[~, sort_indices] = sort(nr_of_nonzero);
M_sorted = M(sort_indices,:);
M_sorted(M_sorted==0)=NaN; % should not compare 0s (?)
% get rid of the matches
for i=1:size(M_sorted, 1)-1
for j=(i+1):size(M_sorted, 1)
[C,ia,ib] = intersect(M_sorted(i,:),M_sorted(j,:));
if numel(C)>0
M_sorted(i,:) = NaN;
end
break;
end
end
% reorder
M(sort_indices,:) = M_sorted;
% remove all NaN rows
M(all(isnan(M),2),:) = [];
% back to 0s
M(isnan(M)) = 0;
I'm not doing all the code here, but here's the steps that I would take to solve it. You will likely have to try different ways of doing it to obtain the intended result (i.e. vector operations, while loop, for loop, etc.).
Problem
Rows are repetitive and need to be reduced in a more compact form.
Solution
Look up mat2str.
Convert your vectors (rows) to strings. This can be done with temporary values like tmpstr1 = mat2str(yourMatrix(rowToBeCompared, :));
Parse the first string from beginning to end, while parsing the second string in the same way to make comparisons.
use strcmp to see if the string characters (or strings themselves) are the same: http://www.mathworks.com/help/matlab/ref/strcmp.html
Delete a row if you find it appropriate with yourMatrix[rowToDelete, : ] = [];
Try that and see if it works.
Note - Expansion of step 3:
if we have variable a = '[ab+11]';, we can select individual characters from the string like:
a(4)
ans = '+'
a(5)
ans = '1'
a(1)
and = '['
Therefore, you can parse the string with a loop:
for n = 1 : length(a)
if a(n) == '1' || a(n) == '0'
str(n) = a(n);
end
end
Like Sardar Usama said, it's helpful to provide the code so that we can copy and paste into our own MATLAB workspaces.

MATLAB Best way to extract data from text files and convert rows to vectors?

I'm having a lot of issues trying to extract the data from the attached text file ('scratch.txt'). My goal is to convert every row in the text file to a 1xn vector where n is the number of variables in the text file row. I also need to make sure that every value extracted into these vectors is an 8 byte floating point.
This is what I have so far but I don't know how to convert what I currently have as the output into a matrix:
fid = fopen('scratch.txt');
tline = fgetl(fid);
while ischar(tline)
disp(tline)
tline = fgetl(fid);
end
Currently this is what I get as the output:
4 3
1 10
2 30
3 20
4 0
1 4 1
2 1 3
3 3 2
1.e7 1.339 .5
4
1 5 3 4
1
7 5.0
Use str2num to convert tline into a numeric row vector.
Since you have varying number of elements in each row, you cannot convert your data into a matrix: matrix (by definition) has the same number of elements in each row.
What you can do, is store the row in a cell array.
res = {};
fid = fopen('scratch.txt');
tline = fgetl(fid);
while ischar(tline)
res{end+1} = str2num(tline);
tline = fgetl(fid);
end

MATLAB reading to the end of a binary file

I think the solution will be quite simple for somebody with some MATLAB knowhow however I do not know how to do it.
I have a binary file that I am reading with fread and I am reading the first 4 bytes of this file followed by the next 2 bytes.
I basically want this process of reading 4 bytes followed by 2 bytes repeated till the end of the file is reached.
So the number of bytes read is 4,2,4,2,4,2......
I have the following to read the first pair of data and I want this to repeat.
fileID = fopen('MyBinaryFile');
4bytes = fread(fileID, 4);
fseek(fileID, 4, 0);
2bytes = fread(fileID, 2);
Thanks in advance for any help and suggestions
I take it this is a variant of your former question MATLAB reading a mixed data type binary file.
Your goal is to read a binary file containing mixed data type. In your case it contains 2 columns:
1x single value (4 bytes) and 1x int16 value (2 bytes).
There are several ways to read this type of file. They differ in speed because some ways minimize disk access but require more temporary memory, and other way use just the memory needed but require more disk access (= slower).
Ultimately, the 3 ways I'm going to show you produce exactly the same result.
The direct answer to this question is the version #3 below, but I encourage you to have a look at the 2 other options described here, they are both really worth understanding.
For the purpose of the example, I had to create a binary file as you described. This is done this way:
%% // write example file
A = single(linspace(-3,1,11)) ; %// a few "float" (=single) data
B = int16(-5:5) ; %// a few "int16" data
fileID = fopen('testmixeddata.bin','w');
for il=1:11
fwrite(fileID,A(il),'single');
fwrite(fileID,B(il),'int16');
end
fclose(fileID);
This create a 2 column binary file, the columns being:
11 values of type float going from -3 to 1.
11 values of type int16 going from -5 to +5.
For future reference:
>> disp(A)
-3.0000 -2.6000 -2.2000 -1.8000 -1.4000 -1.0000 -0.6000 -0.2000 0.2000 0.6000 1.0000
>> disp(B)
-5 -4 -3 -2 -1 0 1 2 3 4 5
In each of the solution below, the first column will be read in a variable called varSingle, and the second column in a variable called varInt16.
1) Read all data in one go - convert to proper type after
%% // SOLUTION 1 (fastest) : Read all data in one go - convert to proper type after
fileID = fopen('testmixeddata.bin');
R = fread(fileID,'uint8=>uint8') ; %// read all values, most basic data type (unsigned 8 bit integer)
fclose(fileID);
colSize = [4 2] ; %// number of byte for each column [4 byte single, 2 byte int16]
R = reshape( R , sum(colSize) , [] ) ; %// reshape data into a matrix (6 is because 4+2byte=6 byte per column)
temp = R(1:4,:) ; %// extract data for first column into temporary variable (OPTIONAL)
varSingle = typecast( temp(:) , 'single' ) ; %// convert into "single/float"
temp = R(5:end,:) ; %// extract data for second column
varInt16 = typecast( temp(:) , 'int16' ) ; %// convert into "int16"
This is my favourite method. Specially for speed because it minimizes the read/seek operations on disk, and most post calculations are done in memory (much much faster than disk operations).
Note that the temporary variable I used was only for clarity/verbose, you can avoid it altogether if you get your indexing into the raw data right.
The key thing to understand is the use of the typecast function. And the good news is it got even faster since 2014b.
2) Read column by column (using "skipvalue") - 2 pass approach
%% // SOLUTION 2 : Read column by column (using "skipvalue") - 2 pass approach
col1size = 4 ; %// size of data in column 1 (in [byte])
col2size = 2 ; %// size of data in column 2 (in [byte])
fileID = fopen('testmixeddata.bin');
varSingle = fread(fileID,'*single',col2size) ; %// read all "float" values, skipping all "int16"
fseek(fileID,col1size,'bof') ; %// rewind to beginning of column 2 at the top of the file
varInt16 = fread(fileID,'*int16',col1size) ; %// read all "int16" values, skipping all "float"
fclose(fileID);
That works too. It works fine ... but it is going to be slower than method 1 above, because you will have to scan the file twice. It may be a good option if the file is very large and method 1 above fail because of an out of memory error.
3) Read element by element
%% // SOLUTION 3 : Read element by element (slow - not recommended)
fileID = fopen('testmixeddata.bin');
varSingle=[];varInt16=[];
while ~feof(fileID)
try
varSingle(end+1) = fread(fileID, 1, '*single' ) ;
varInt16(end+1) = fread(fileID, 1, '*int16' ) ;
catch
disp('reached End Of File')
end
end
fclose(fileID);
That does work too, and if you were writing C code it would be more than ok. But in Matlab this is not the recommended way to go (your choice ultimately)
As promised, the 3 methods above will give you exactly what we wrote in the file at the beginning:
>> disp(varSingle)
-3.0000 -2.6000 -2.2000 -1.8000 -1.4000 -1.0000 -0.6000 -0.2000 0.2000 0.6000 1.0000
>> disp(varInt16)
-5 -4 -3 -2 -1 0 1 2 3 4 5
fileID = fopen('MyBinaryFile');
kk=1;
while ~feof(fileID)
bytes4(kk) = fread(fileID, 4);
fseek(fileID, 4, 0);
bytes2(kk) = fread(fileID, 2);
kk=kk+1;
end
the while loop condition is ~feof, which stands for End-Of-File. So as long as you haven't reached the end of your file it runs.
I added the kk just so you store everything and not just overwrite them each loop iteration.
If you want to get the data without loops, there are MATLABish ways to that:
%'Sizes'
T = 4; %'Time record size'
D = 2; %'Date record size'
R = T+D; %'Record size'
%'Open file'
f = fopen('MyBinaryFile', 'rb');
if f < 0
error('Could not open file.');
end;
%'Read the entire file at once, and close file'
buf = fread(f, Inf, '*uint8');
fclose(f);
%'Ignore the last unpadded bytes, and reshape by the size of 1 record'
buf = reshape(buf(1:R*fix(numel(buf)/R)), R, []);
%'Pinpoint the data'
time_bytes = buf( 1: T, :);
date_bytes = buf(T+1:T+D, :);

Exporting blank values into a .txt file - MATLAB

I'm currently trying to export multiple matrices of unequal lengths into a delimited .txt file thus I have been padding the shorter matrices with 0's such that dlmwrite can use horzcat without error:
dlmwrite(filename{1},[a,b],'delimiter','\t')
However ideally I do not want the zeroes to appear in the .txt file itself - but rather the entries are left blank.
Currently the .txt file looks like this:
55875 3.1043e+05
56807 3.3361e+05
57760 3.8235e+05
58823 4.2869e+05
59913 4.3349e+05
60887 0
61825 0
62785 0
63942 0
65159 0
66304 0
67509 0
68683 0
69736 0
70782 0
But I want it to look like this:
55875 3.1043e+05
56807 3.3361e+05
57760 3.8235e+05
58823 4.2869e+05
59913 4.3349e+05
60887
61825
62785
63942
65159
66304
67509
68683
69736
70782
Is there anyway I can do this? Is there an alternative to dlmwrite which will mean I do not need to have matrices of equal lengths?
If a is always longer than b you could split vector a into two vectors of same length as vector b and the rest:
a = [1 2 3 4 5 6 7 8]';
b = [9 8 7 ]';
len = numel(b);
dlmwrite( 'foobar.txt', [a(1:len), b ], 'delimiter', '\t' );
dlmwrite( 'foobar.txt', a(len+1:end), 'delimiter', '\t', '-append');
You can read in the numeric data and convert to string and then add proper whitespaces to have the final output as string based cell array, which you can easily write into the output text file.
Stage 1: Get the cell of strings corresponding to the numeric data from column vector inputs a, b, c and so on -
%// Concatenate all arrays into a cell array with numeric data
A = [{a} {b} {c}] %// Edit this to add more columns
%// Create a "regular" 2D shaped cell array to store the cells from A
lens = cellfun('length',A)
max_lens = max(lens)
A_reg = cell(max_lens,numel(lens))
A_reg(:) = {''}
A_reg(bsxfun(#le,[1:max_lens]',lens)) = cellstr(num2str(vertcat(A{:}))) %//'
%// Create a char array that has string data from input arrays as strings
wsp = repmat({' '},max_lens,1) %// Create whitespace cell array
out_char = [];
for iter = 1:numel(A)
out_char = [out_char char(A_reg(:,iter)) char(wsp)]
end
out_cell = cellstr(out_char)
Stage 2: Now, that you have out_cell as the cell array that has the strings to be written to the text file, you have two options next for the writing operation itself.
Option 1 -
dlmwrite('results.txt',out_cell(:),'delimiter','')
Option 2 -
outfile = 'results.txt';
fid = fopen(outfile,'w');
for row = 1:numel(out_cell)
fprintf(fid,'%s\n',out_cell{row});
end
fclose(fid);

MATLAB: How to load an array without delimiter from file

I have been using the load function until now to load my space separated files into arrays in matlab. However this wastes lots of space for me since my values are either 0 or 1. Thus instead of writing files like
0 1 1 0
1 1 1 0
I removed the spaces to create half as big files like:
0110
1110
However now load doesn't work correctly any more (creates a matrix I think with only the first number, so 2x1 instead of 2x4).
I looked around with importdata, reading the file line by line and lots of other stuff but I couldn't find a clear solution.
So essentially I want to read a matrix from a file which doesn't have a delimiter. Every number is an element of the array
Does anyone know of a clean way to do this?
Thanks
Here is one way:
data.txt
0110
1110
MATLAB
%# read file lines as cell array of strings
fid = fopen('data.txt','rt');
C = textscan(fid, '%s', 'Delimiter','');
fclose(fid);
%# extract digits
C = cell2mat(cellfun(#(s)s-'0', C{1}, 'Uniform',false));
result:
C =
0 1 1 0
1 1 1 0
If you are really concerned about memory, maybe you cast as boolean: C = logical(C) as 0/1 are the possible values.
Adapted from the example here:
function ret = readbin()
fid = fopen('data.txt');
ret = [];
tline = fgetl(fid);
while ischar(tline)
if isempty(ret)
ret = tline;
else
ret = [ret; tline];
end
tline = fgetl(fid);
end
% Turn char '0' into numerical 0
ret = ret - 48;
fclose(fid);
end
Subtracting 48 (ASCII code for '0') you get a numeric matrix with 1's and 0's in the appropriate places. This is the output I get at the end of the function:
K>> ret
ret =
0 1 1 0
1 1 1 0
K>> class(ret)
ans =
double
Inspired from this question: Convert decimal number to vector, here is a proposition, but I don't think it will work for large "numbers":
arrayfun(#str2double, str);
An alternate form of Amro's solution would be either
fid = fopen('data.txt','r');
C = textscan(fid, '%1d%1d%1d%1d');
fclose(fid);
M = cell2mat(C)
Or
fid = fopen('data.txt','r');
M_temp = textscan(fid, '%1d');
fclose(fid);
M = reshape(M_temp{1}, 2,4)
These are slightly different from Amro's solution as the first reads in a %1d format number (1 character integer) and returns a cell array C that can be converted using a much simpler cell2mat command. The second bit of code reads in only a vector of those values then reshapes the result---which works okay if you already know the size of the data but may need additional work if not.
If your actual files are very large you may find that one of these ways is faster, although it is hard to tell without actually running them side-by-side.