Matlab clustering and data formats - matlab

Leading on from a previous question FCM Clustering numeric data and csv/excel file Im now trying to figure out how to take the outputed information and create a workable .dat file for use with clustering in matlab.
%# read the list of features
fid = fopen('kddcup.names','rt');
C = textscan(fid, '%s %s', 'Delimiter',':', 'HeaderLines',1);
fclose(fid);
%# determine type of features
C{2} = regexprep(C{2}, '.$',''); %# remove "." at the end
attribNom = [ismember(C{2},'symbolic');true]; %# nominal features
%# build format string used to read/parse the actual data
frmt = cell(1,numel(C{1}));
frmt( ismember(C{2},'continuous') ) = {'%f'}; %# numeric features: read as number
frmt( ismember(C{2},'symbolic') ) = {'%s'}; %# nominal features: read as string
frmt = [frmt{:}];
frmt = [frmt '%s']; %# add the class attribute
%# read dataset
fid = fopen('kddcup.data','rt');
C = textscan(fid, frmt, 'Delimiter',',');
fclose(fid);
%# convert nominal attributes to numeric
ind = find(attribNom);
G = cell(numel(ind),1);
for i=1:numel(ind)
[C{ind(i)},G{i}] = grp2idx( C{ind(i)} );
end
%# all numeric dataset
M = cell2mat(C);
I have several types of data which looks like this:
I tried the below method to create a .dat file but came up with the error:
>> a = load('matlab.mat');
>> save 'matlab.dat' a -ascii
Warning: Attempt to write an unsupported data type
to an ASCII file.
Variable 'a' not written to file.
>> a = load('data.mat');
>> save 'matlab.dat' a -ascii
Warning: Attempt to write an unsupported data type
to an ASCII file.
Variable 'a' not written to file.
>> save 'matlab.dat' a
>> findcluster('matlab.dat')
??? Error using ==> load
Number of columns on line 1 of ASCII file
C:\Users\Garrith\Documents\MATLAB\matlab.dat
must be the same as previous lines.
Error in ==> findcluster>localloadfile at 471
load(filename);
Error in ==> findcluster at 160
localloadfile(filename, param);
Matlabs clustering tool works on multi-dimensional data sets, but only displays on two
dimensions. You then use the x and y axis to compare against but im not quite sure if I will be able to create a clustering 2d analysis from the current data?
What I need to do is normalize the m file from my previous post FCM Clustering numeric data and csv/excel file
To normalize the data:
find the minimum and maximum dataset
Normalized scale minimum and maximum
Number in the data set
Normalized value
So first question is how do I find the minimum and maximum numbers in my dataset(m)
Step 1:
Find the largest and smallest values in the data set and represent them with the variables capital A and capital B:
Lets say minimum number A = 92000
and max number say B = 64525000
Step 2 normalize
Identify the smallest and largest numbers and set the variables to lower case a and b
unsure how to do this in matlab (not sure how you normalize the data to start with)
set the minimum = a = 1
set the maximum = b = 10
step 3
calculate the normalized value of any number x using the equation
A = 92000
B = 64525000
a = 1
b = 10
x = 2214000
a + (x - A)(b - a)/(B - A)
1+(2214000 - 92000)(10-1)/(6425000 - 92000)
= 4.01

Looking at the errors in the middle of your question. a = load(matfile) returns a structure, which is not supported by the ASCII-based MAT-file format. Try reading the documentation.

Related

How to store .csv data and calculate average value in MATLAB

Can someone help me to understand how I can save in matlab a group of .csv files, select only the columns in which I am interested and get as output a final file in which I have the average value of the y columns and standard deviation of y axes? I am not so good in matlab and so I kindly ask if someone to help me to solve this question.
Here what I tried to do till now:
clear all;
clc;
which_column = 5;
dirstats = dir('*.csv');
col3Complete=0;
col4Complete=0;
for K = 1:length(dirstats)
[num,txt,raw] = xlsread(dirstats(K).name);
col3=num(:,3);
col4=num(:,4);
col3Complete=[col3Complete;col3];
col4Complete=[col4Complete;col4];
avgVal(K)=mean(col4(:));
end
col3Complete(1)=[];
col4Complete(1)=[];
%columnavg = mean(col4Complete);
%columnstd = std(col4Complete);
% xvals = 1 : size(columnavg,1);
% plot(xvals, columnavg, 'b-', xvals, columnavg-columnstd, 'r--', xvals, columnavg+columstd, 'r--');
B = reshape(col4Complete,[5000,K]);
m=mean(B,2);
C = reshape (col4Complete,[5000,K]);
S=std(C,0,2);
Now I know that I should compute mean and stdeviation inside for loop, using mean()function, but I am not sure how I can use it.
which_column = 5;
dirstats = dir('*.csv');
col3Complete=[]; % Initialise as empty matrix
col4Complete=[];
avgVal = zeros(length(dirstats),2); % initialise as columnvector
for K = 1:length(dirstats)
[num,txt,raw] = xlsread(dirstats(K).name);
col3=num(:,3);
col4=num(:,4);
col3Complete=[col3Complete;col3];
col4Complete=[col4Complete;col4];
avgVal(K,1)=mean(col4(:)); % 1st column contains mean
avgVal(K,2)=std(col4(:)); % 2nd column contains standard deviation
end
%columnavg = mean(col4Complete);
%columnstd = std(col4Complete);
% xvals = 1 : size(columnavg,1);
% plot(xvals, columnavg, 'b-', xvals, columnavg-columnstd, 'r--', xvals, columnavg+columstd, 'r--');
B = reshape(col4Complete,[5000,K]);
meanVals=mean(B,2);
I didn't change much, just initialised your arrays as empty arrays so you do not have to delete the first entry later on and made avgVal a column vector with the mean in column 1 and the standard deviation in column 1. You can of course add two columns if you want to collect those statistics for your 3rd column in the csv as well.
As a side note: xlsread is rather heavy for reading files, since Excel is horribly inefficient. If you want to read a structured file such as a csv, it's faster to use importdata.
Create some random matrix to store in a file with header:
A = rand(1e3,5);
out = fopen('output.csv','w');
fprintf(out,['ColumnA', '\t', 'ColumnB', '\t', 'ColumnC', '\t', 'ColumnD', '\t', 'ColumnE','\n']);
fclose(out);
dlmwrite('output.csv', A, 'delimiter','\t','-append');
Load it using csvread:
data = csvread('output.csv',1);
data now contains your five columns, without any headers.

Create a 2 column matrix with 2 different format types

very very new to Matlab and I'm having trouble reading a binary file into a matrix. The problem is I am trying to write the binary file into a two column matrix (which has 100000's of rows) where each column is a different format type.
I want column 1 to be in 'int8' format and column 2 to be a 'float'
This is my attempt so far:
FileID= fopen ('MyBinaryFile.example');
[A,count] = fread(FileID,[nrows, 2],['int8','float'])
This is not working because I get the error message 'Error using fread' 'Invalid Precision'
I will then go on to plot once I have successfully done this.
Probably a very easy solution to someone with matlab experience but I haven't been successful at finding a solution on the internet.
Thanks in advance to anyone who can help.
You should be aware that Matlab cannot hold different data type in a matrix (it can do so in a cell array but this is another topic). So there is no point trying to read your mixed type file in one go in one single matrix ... it is not possible.
Unless you want a cell array, you will have to use 2 different variables for your 2 columns of different type. Once this is established, there are many ways to read such a file.
For the purpose of the example, I had to create a binary file as you described. This is done this way:
%% // write example file
A = int8(-5:5) ; %// a few "int8" data
B = single(linspace(-3,1,11)) ; %// a few "float" (=single) data
fileID = fopen('testmixeddata.bin','w');
for il=1:11
fwrite(fileID,A(il),'int8');
fwrite(fileID,B(il),'single');
end
fclose(fileID);
This create a 2 column binary file, with first column: 11 values of type int8 going from -5 to +5, and second column: 11 values of type float going from -3 to 1.
In each of the solution below, the first column will be read in a variable called C, and the second column in a variable called D.
1) Read all data in one go - convert to proper type after
%% // Read all data in one go - convert to proper type after
fileID = fopen('testmixeddata.bin');
R = fread(fileID,'uint8=>uint8') ; %// read all values, most basic data type (unsigned 8 bit integer)
fclose(fileID);
R = reshape( R , 5 , [] ) ; %// reshape data into a matrix (5 is because 1+4byte=5 byte per column)
temp = R(1,:) ; %// extract data for first column into temporary variable (OPTIONAL)
C = typecast( temp , 'int8' ) ; %// convert into "int8"
temp = R(2:end,:) ; %// extract data for second column
D = typecast( temp(:) , 'single' ) ; %// convert into "single/float"
This is my favourite method. Specially for speed because it minimizes the read/seek operations on disk, and most post calculations are done in memory (much much faster than disk operations).
Note that the temporary variable I used was only for clarity/verbose, you can avoid it altogether if you get your indexing into the raw data right.
The key thing to understand is the use of the typecast function. And the good news is it got even faster since 2014b.
2) Read column by column (using "skipvalue") - 2 pass approach
%% // Read column by column (using "skipvalue") - 2 pass approach
col1size = 1 ; %// size of data in column 1 (in [byte])
col2size = 4 ; %// size of data in column 2 (in [byte])
fileID = fopen('testmixeddata.bin');
C = fread(fileID,'int8=>int8',col2size) ; %// read all "int8" values, skipping all "float"
fseek(fileID,col1size,'bof') ; %// rewind to beginning of column 2 at the top of the file
D = fread(fileID,'single=>single',col1size) ; %// read all "float" values, skipping all "int8"
fclose(fileID);
That works too. It works fine ... but probably much slower than above. Although it may be clearer code to read for someone else ... I find that ugly (and yet I've used this way for several years until I got to use the method above).
3) Read element by element
%% // Read element by element (slow - not recommended)
fileID = fopen('testmixeddata.bin');
C=[];D=[];
while ~feof(fileID)
try
C(end+1) = fread(fileID,1,'int8=>int8') ;
D(end+1) = fread(fileID,1,'single=>single') ;
catch
disp('reached End Of File')
end
end
fclose(fileID);
Talking about ugly code ... that does work too, and if you were writing C code it would be more than ok. But in Matlab ... please avoid ! (well, your choice ultimately)
Merging in one variable
If really you want all of that in one single variable, it could be a structure or a cell array. For a cell array (to keep matrix indexing style), simply use:
%% // Merge into one "cell array"
Data = { C , D } ;
Data =
[11x1 int8] [11x1 single]

Reading data from a Text File into Matlab array

I am having difficulty in reading data from a .txt file using Matlab.
I have to create a 200x128 dimension array in Matlab, using the data from the .txt file. This is a repetitive task, and needs automation.
Each row of the .txt file is a complex number of form a+ib, which is of form a[space]b. A sample of my text file :
Link to text file : Click Here
(0)
1.2 2.32222
2.12 3.113
.
.
.
3.2 2.22
(1)
4.4 3.4444
2.33 2.11
2.3 33.3
.
.
.
(2)
.
.
(3)
.
.
(199)
.
.
I have numbers of rows (X), inside the .txt file surrounded by brackets. My final matrix should be of size 200x128. After each (X), there are exactly 128 complex numbers.
Here is what I would do. First thing, delete the "(0)" types of lines from your text file (could even use a simple shells script for that). This I put into the file called post2.txt.
# First, load the text file into Matlab:
A = load('post2.txt');
# Create the imaginary numbers based on the two columns of data:
vals = A(:,1) + i*A(:,2);
# Then reshape the column of complex numbers into a matrix
mat = reshape(vals, [200,128]);
The mat will be a matrix of 200x128 complex data. Obviously at this point you can put a loop around this to do this multiple times.
Hope that helps.
You can read the data in using the following function:
function data = readData(aFilename, m,n)
% if no parameters were passed, use these as defaults:
if ~exist('aFilename', 'var')
m = 128;
n = 200;
aFilename = 'post.txt';
end
% init some stuff:
data= nan(n, m);
formatStr = [repmat('%f', 1, 2*m)];
% Read in the Data:
fid = fopen(aFilename);
for ind = 1:n
lineID = fgetl(fid);
dataLine = fscanf(fid, formatStr);
dataLineComplex = dataLine(1:2:end) + dataLine(2:2:end)*1i;
data(ind, :) = dataLineComplex;
end
fclose(fid);
(edit) This function can be improved by including the (1) parts in the format string and throwing them out:
function data = readData(aFilename, m,n)
% if no parameters were passed, use these as defaults:
if ~exist('aFilename', 'var')
m = 128;
n = 200;
aFilename = 'post.txt';
end
% init format stuff:
formatStr = ['(%*d)\n' repmat('%f%f\n', 1, m)];
% Read in the Data:
fid = fopen(aFilename);
data = fscanf(fid, formatStr);
data = data(1:2:end) + data(2:2:end)*1i;
data = reshape(data, n,m);
fclose(fid);

How to import a sequence of Excel Files in matlab as a column vectors or as a cell array?

I want to import a sequence of excel files with a large amount of data in them. The problem that I have is I want to process the data in each file at a time and store the output from this into a variable, but each time I try to process a different file the variable gets overwritten in the variable workspace. Is there anyway I could store these files and process each file at a time?
numFiles = 1;
range = 'A2:Q21';
sheet = 1;
myData = cell(1,numFiles); % Importing data from Excel
for fileNum = 1:numFiles
fileName = sprintf('myfile%02d.xlsx',fileNum);
myData{fileNum} = importfile3(fileName,sheet,range);
end
data = cell2mat(myData);
The actual data import is performed by importfile3 which is, for the most part, a wrapper for the xlsread function that returns a matrix corresponding to the specified range of excel data.
function data = importfile3(workbookFile, sheetName, range)
% If no sheet is specified, read first sheet
if nargin == 1 || isempty(sheetName)
sheetName = 1;
end
% If no range is specified, read all data
if nargin <= 2 || isempty(range)
range = '';
end
%% Import the data
[~, ~, raw] = xlsread(workbookFile, sheetName, range);
%% Replace non-numeric cells with 0.0
R = cellfun(#(x) ~isnumeric(x) || isnan(x),raw); % Find non-numeric cells
raw(R) = {0.0}; % Replace non-numeric cells
%% Create output variable
data = cell2mat(raw);
The issue that you are running in to is a result of cell2mat concatenating all of the data in your cells in to one large 2-dimensional matrix. If you were to import two excel files with 20 rows and 17 columns, each, this would result in a 2-dimensional matrix of size [20 x 34]. The doc for cell2mat has a nice visual describing this.
I see that your importfile3 function returns a matrix, and based on your use of cell2mat in your final line of code, it looks like you would like to have your final result be in the form of a matrix. So I think the easiest way to go about this is to just bypass the intermediate myData cell array.
In the example code below, the resulting data is a 3-dimensional matrix. The 1st dimension indicates row number, 2nd dimension is column number, and 3rd dimension is file number. Cell arrays are very useful for "jagged" data, but based on the code you provided, each excel data set that you import will have the same number of rows and columns.
numFiles = 2;
range = 'A2:Q21';
sheet = 1;
% Number of rows and cols known before data import
numRows = 20;
numCols = 17;
data = zeros(numRows,numCols,numFiles);
for fileNum = 1:numFiles
fileName = sprintf('myfile%02d.xlsx',fileNum);
data(:,:,fileNum) = importfile3(fileName,sheet,range);
end
Accessing this data is now very straight-forward.
data(:,:,1) returns the data imported from your first excel file.
data(:,:,2) returns the data imported from your second excel file.
etc.

How to efficiently find correlation and discard points outside 3-sigma range in MATLAB?

I have a data file m.txt that looks something like this (with a lot more points):
286.842995
3.444398
3.707202
338.227797
3.597597
283.740414
3.514729
3.512116
3.744235
3.365461
3.384880
Some of the values (like 338.227797) are very different from the values I generally expect (smaller numbers).
So, I am thinking that
I will remove all the points that lie outside the 3-sigma range. How can I do that in MATLAB?
Also, the bigger problem is that this file has a separate file t.txt associated with it which stores the corresponding time values for these numbers. So, I'll have to remove the corresponding time values from the t.txt file also.
I am still learning MATLAB, and I know there would be some good way of doing this (better than storing indices of the elements that were removed from m.txt and then removing those elements from the t.txt file)
#Amro is close, but the FIND is unnecessary (look up logical subscripting) and you need to include the mean for a true +/-3 sigma range. I would go with the following:
%# load files
m = load('m.txt');
t = load('t.txt');
%# find values within range
z = 3;
meanM = mean(m);
sigmaM = std(m);
I = abs(m - meanM) <= z * sigmaM;
%# keep values within range
m = m(I);
t = t(I);
%# load files
m = load('m.txt');
t = load('t.txt');
%# find outliers indices
z = 3;
idx = find( abs(m-mean(m)) > z*std(m) );
%# remove them from both data and time values
m(idx) = [];
t(idx) = [];