I know this kind of thing has been asked before but nothing I have tried has helped.
I've borrowed a function from github for a project but it's in matlab and my project is in python so I want to save the output data as a csv so I can input it back into python.
I've literally only been using matlab for about 4 hrs so this could be a stupid question...
I've tried an array of different methods for saving as a csv but none of them have worked. They don't throw errors though, the files just don't show up.
This is my latest attempt:
clear;clc;
fileName = 'AlzScSK.csv';
csvData = importdata(fileName);
rawData = csvData.data;
dataMV = rawData;
scaledMV = dataMV
K = 6;
function [dataImputed dataImputedWeighted] = NSkNNDataHM(dataMV)
% Function to impute missing values in a dataset using NSkNN. scaledMV is
% the autoscale dataset with the missing values and K is the # of nearest
% neighbors to use to impute the data. filteredMV is the missing value
% dataset after filtering, but before autoscaling.
% NSkNNData_HM does not skip neighbors that have NaN values in the same
% location as the metabolite being imputed. Instead, it replaces these NaN
% values with the half minimum value of that metabolite.
numCol = size(scaledMV,2);
for col = 1:numCol
rowMV{col} = find(isnan(scaledMV(:,col))); % Finds the row # of every missing value in each column
end
counter = 1;
for targetCol = 1:numCol % i is the target sample
for neighborCol = 1:numCol % Calculate the Euclidean distances between the target sample (i) and the other samples (j)
MVRowsRemoved = scaledMV;
rowsToRemove = union(rowMV{targetCol},rowMV{neighborCol}); % Ignore NaNs when calculating distances
MVRowsRemoved(rowsToRemove,:) = []; % Remove rows in target sample that have missing values
numMetInCalc = size(MVRowsRemoved,1); % # of metabolites used in calculation of metabolites in order to weight distance
% Divide by numMetInCalc to avoid scenarios where distances
% calculated with only a few metabolites are weighted more heavily
% over distances that are close to the same distance, but used more
% metabolites in the calculation.
distance = pdist2(MVRowsRemoved(:,targetCol)',MVRowsRemoved(:,neighborCol)')/sqrt(numMetInCalc);
%distance = pdist2(MVRowsRemoved(:,targetCol)',MVRowsRemoved(:,neighborCol)');
distIdx(counter,:) = [targetCol distance neighborCol];
counter = counter+1;
end
end
% Remove rows that calculated the Euclidean distance between a sample and
% itself.
sameSample = find(distIdx(:,1)==distIdx(:,3));
distIdx(sameSample,:) = [];
distIdxSorted = sortrows(distIdx);
minValperRow = min(scaledMV,[],2); % Finds minimum of each metabolite
% Implement NSkNN with half minimum replacement
dataImputed = scaledMV;
dataImputedWeighted = scaledMV;
for targetCol = 1:numCol
numMV = size(rowMV{targetCol},1); % # of missing values in a column
firstNNIdx = (targetCol-1)*(numCol-1)+1; % Column index of the first nearest neighbor of target sample (i) in distIdxSorted
for MVidx = 1:numMV % For each missing value in the target sample...
tempDataMV = scaledMV;
NN = distIdxSorted(firstNNIdx:firstNNIdx+K-1,3); % Column #s of the k nearest neighbors
DistanceNN = distIdxSorted(firstNNIdx:firstNNIdx+K-1,2); % Distances of k nearest neighbors
idxNaNinCol = find(isnan(tempDataMV(rowMV{targetCol}(MVidx),NN))); % Finds missing values that are the same metabolite as the target metabolite to be imputed
if isempty(idxNaNinCol)~=1 % If NaN values found...
% If there are NaN values in the nearest neighbor
% metabolite that is the same as the target metabolite to
% be imputed, replace with half min value of the target
% metabolite
tempDataMV(rowMV{targetCol}(MVidx),NN(idxNaNinCol)) = (minValperRow(rowMV{targetCol}(MVidx)) - avgMV(rowMV{targetCol}(MVidx))/stddevMV(rowMV{targetCol}(MVidx)))/2;
end
% Imputed data is weighted by the inverse of the distance
WeightMultiplier = (1./DistanceNN')/sum(1./DistanceNN);
dataImputedWeighted(rowMV{targetCol}(MVidx),targetCol) = sum(tempDataMV(rowMV{targetCol}(MVidx),NN).*WeightMultiplier);
% Not weighted
dataImputed(rowMV{targetCol}(MVidx),targetCol) = mean(tempDataMV(rowMV{targetCol}(MVidx),NN));
writematrix(dataImputed,['M.csv'])
end
end
end
I'm not sure if I'm maybe referring to the wrong thing or if I've put the writematrix(dataImputed,['M.csv']) in the wrong place.
I've also tried these sorts of routes:
% Convert cell to a table and use first row as variable names
T = cell2table(c(2:end,:),'VariableNames',c(1,:))
% Write the table to a CSV file
writetable(T,'myDataFile.csv')
dlmwrite(filename,M)
csvwrite('filename.csv',variable2,0,2)
This is currently what my workspace looks like
Try this: writematrix(dataImputed,'M.csv','WriteMode','append')
If it's still not working, sharing the AlzScSK.csv with us would be helpful.
Related
I am dealing with k-neighbour problem in MATLAB. There is an image with row r and column c. And divide it into r*c blocks - each blcok represents a patch centered in each pixel.
And I want to find the k-nearest neighbourbood of each blcok within a specific search range. At first I use knnsearch with kdTree:
ns = createns(Block','nsmethod','kdtree');
[Index_nearest,dist] = knnsearch(ns,Block','k',k+1);
However, I find that it would find k-nearest neighbourhood in all blocks, instead of the specific range. Therefore, is there any other method to achieve the goal? Could anyone give me some hints? Thanks in advance!
Edit: the code for knnsearch:
function [Index_nearest, Weight] = Compute_Weight(Input, Options)
% Input the data and pre-processing
windowsize = Options.winsize;
k = Options.directionsize;
deviation = Options.deviation; % Deviation for Gaussian kernel
h = Options.h; % This parameter is for controling the weights
[r,c] = size(Input);
In_pad = padarray(Input, [windowsize windowsize], 'symmetric');
window_size = (2*windowsize+1)*(2*windowsize+1);
Block = zeros(window_size,r*c);
%% Split the input data into blocks
for i = 1:r
for j = 1:c
block = In_pad(i:i+2*windowsize,j:j+2*windowsize);
Block(:,r*(i-1)+j) = block(:); % expand from column to column
end
end
%% Find k-nearest neighbour blocks
% Create a KDtree with all local patches
ns = createns(Block','nsmethod','kdtree');
% Find the patches closest by in intensity in relation to the local patch itself
[Index_nearest,ddd] = knnsearch(ns,Block','k',k+1);
Index_nearest = Index_nearest';
Index_nearest = Index_nearest(2:k+1,:);
end
How to automatically select the column vector of a matrix within which the scalar values in a subset of elements is closest to those in a predefined goal vector of the same sub set?
I solved the problem and tested method on 100,10 matrix, it works - should also work for larger matrices, while hopefully not becoming too computationally expensive
%% Selection of optimal source function
% Now need to select the best source function in the data matrix
% k = 1,2,...n within which scalar values of a random set of elements are
% closest to a pre-defined goal vector with the same random set
% Proposed Method:
% Project the columns of the data matrix onto the goal vector
% Calculate the projection error vector matrix; the null space of the
% local goal vector, is orthogonal to its row space
% The column holding the minimum error vector is the optimal column
% [1] find the null space of the goal vector, containing the projection
% errors
mpg = pinv(gloc);
xstar = mpg*A;
p = gloc*xstar;
nA = A-p;
% [2] the minimum error vector will correspond to the optimal source
% function
normnA = zeros(1,n);
for i = 1:n
normnA(i) = norm(nA(:,i));
end
minnA = min(normnA);
[row,k] = find(normnA == minnA);
disp('The optimal source function is: ')
disp(k)
I want to use a 10-fold cross validation method, which tests which polynomial form (first, second, or
third order) gives a better fit. I want to divide my data set into 10 subsets and remove 1 subset from the 10 data sets. Derive a regression model without this subset, predict the output values for this subset using the derived regression model, and computed the residuals. Finally repeat the calculation routine for each subset and sum the squares of the resulting residuals.
I already coded the following on Matlab 2013b, which sample the data and test the regression on the training data. I am stuck on how to repeat this for every subset and how to compare which polynomial form gives a better fit.
% Sample the data
parm = [AT];
n = length(parm);
k = 10; % how many parts to use
allix = randperm(n); % all data indices, randomly ordered
numineach = ceil(n/k); % at least one part must have this many data points
allix = reshape([allix NaN(1,k*numineach-n)],k,numineach);
for p=1:k
testix = allix(p,:); % indices to use for testing
testix(isnan(testix)) = []; % remove NaNs if necessary
trainix = setdiff(1:n,testix); % indices to use for training
%train = parm(trainix); %gives the training data
%test = parm(testix); %gives the testing data
end
% Derive regression on the training data
Sal = Salinity(trainix);
Temp = Temperature(trainix);
At = parm(trainix);
xyz =[Sal Temp At];
% Fit a Polynomial Surface
surffit = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
% Shows equation, rsquare, rmse
[b,bint,r] = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
Regarding executing your code for every subset, you can put the fit inside the loop and store the results, e.g.
% Sample the data
parm = [AT];
n = length(parm);
k = 10; % how many parts to use
allix = randperm(n); % all data indices, randomly ordered
numineach = ceil(n/k); % at least one part must have this many data points
allix = reshape([allix NaN(1,k*numineach-n)],k,numineach);
bAll = []; bintAll = []; rAll = [];
for p=1:k
testix = allix(p,:); % indices to use for testing
testix(isnan(testix)) = []; % remove NaNs if necessary
trainix = setdiff(1:n,testix); % indices to use for training
%train = parm(trainix); %gives the training data
%test = parm(testix); %gives the testing data
% Derive regression on the training data
Sal = Salinity(trainix);
Temp = Temperature(trainix);
At = parm(trainix);
xyz =[Sal Temp At];
% Fit a Polynomial Surface
surffit = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
% Shows equation, rsquare, rmse
[b,bint,r] = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
bAll = [bAll, coeffvalues(b)]; bintAll = [bintAll,bint]; rAll = [rAll,r];
end
Regarding the best fit, you probably can pick the fit with the lowest rmse.
So I'm writing a k-means script in MATLAB, since the native function doesn't seem to be very efficient, and it seems to be fully operational. It appears to work on the small training set that I'm using (which is a 150x2 matrix fed via text file). However, the runtime is taking exponentially longer for my target data set, which is a 3924x19 matrix.
I'm not the greatest at vectorization, so any suggestions would be greatly appreciated. Here's my k-means script so far (I know I'm going to have to adjust my convergence condition, since it's looking for an exact match, and I'll probably need more iterations for a dataset this large, but I want it to be able to finish in a reasonable time first, before I crank that number up):
clear all;
%take input file (manually specified by user
disp('Please type input filename (in working directory): ')
target_file = input('filename: ', 's');
%parse and load into matrix
data = load(target_file);
%prompt name of output file for later) UNCOMMENT BELOW TWO LINES LATER
% disp('Please type output filename (to be saved in working directory): ')
% output_name = input('filename:', 's')
%prompt number of clusters
disp('Please type desired number of clusters: ')
c = input ('number of clusters: ');
%specify type of kmeans algorithm ('regular' for regular, 'fuzzy' for fuzzy)
%UNCOMMENT BELOW TWO LINES LATER
% disp('Please specify type (regular or fuzzy):')
% runtype = input('type: ', 's')
%initialize cluster centroid locations within bounds given by data set
%initialize rangemax and rangemin row vectors
%with length same as number of dimensions
rangemax = zeros(1,size(data,2));
rangemin = zeros(1,size(data,2));
%map max and min values for bounds
for dim = 1:size(data,2)
rangemax(dim) = max(data(:,dim));
rangemin(dim) = min(data(:,dim));
end
% rangemax
% rangemin
%randomly initialize mu_k (center) locations in (k x n) matrix where k is
%cluster number and n is number of dimensions/coordinates
mu_k = zeros(c,size(data,2));
for k = 1:size(data,2)
mu_k(k,:) = rangemin + (rangemax - rangemin).*rand(1,1);
end
mu_k
%iterate k-means
%initialize holding variable for distance comparison
comparisonmatrix = [];
%initialize assignment vector
assignment = zeros(size(data,1),1);
%initialize distance holding vector
dist = zeros(1,size(data,2));
%specify convergence threshold
%threshold = 0.001;
for iteration = 1:25
%save current assignment values to check convergence condition
hold_assignment = assignment;
for point = 1:size(data,1)
%calculate distances from point to centers
for k = 1:c
%holding variables
comparisonmatrix = [data(point,:);mu_k(k,:)];
dist(k) = pdist(comparisonmatrix);
end
%record location of mininum distance (location value will be between 1
%and k)
[minval, location] = min(dist);
%assign cluster number (analogous to location value)
assignment(point) = location;
end
%check convergence criteria
if isequal(assignment,hold_assignment)
break
end
%revise mu_k locations
%count number of each label
assignment_count = zeros(1,c);
for i = 1:size(data,1)
assignment_count(assignment(i)) = assignment_count(assignment(i)) + 1;
end
%compute centroids
point_total = zeros(size(mu_k));
for row = 1:size(data,1)
point_total(assignment(row),:) = point_total(assignment(row)) + data(row,:);
end
%move mu_k values to centroids
for center = 1:c
mu_k(center,:) = point_total(center,:)/assignment_count(center);
end
end
There are a lot of loops in there, so I feel that there's a lot of optimization to be made. However, I think I've just been staring at this code for far too long, so some fresh eyes could help. Please let me know if I need to clarify anything in the code block.
When the above code block is executed (in context) on the large dataset, it takes 3732.152 seconds, according to MATLAB's profiler, to make the full 25 iterations (I'm assuming it hasn't "converged" according to my criteria yet) for 150 clusters, but about 130 of them return NaNs (130 rows in mu_k).
Profiling will help, but the place to rework your code is to avoid the loop over the number of data points (for point = 1:size(data,1)). Vectorize that.
In your for iteration loop here is a quick partial example,
[nPoints,nDims] = size(data);
% Calculate all high-dimensional distances at once
kdiffs = bsxfun(#minus,data,permute(mu_k,[3 2 1])); % NxDx1 - 1xDxK => NxDxK
distances = sum(kdiffs.^2,2); % no need to do sqrt
distances = squeeze(distances); % Nx1xK => NxK
% Find closest cluster center for each point
[~,ik] = min(distances,[],2); % Nx1
% Calculate the new cluster centers (mean the data)
mu_k_new = zeros(c,nDims);
for i=1:c,
indk = ik==i;
clustersizes(i) = nnz(indk);
mu_k_new(i,:) = mean(data(indk,:))';
end
This isn't the only (or the best) way to do it, but it should be a decent example.
Some other comments:
Instead of using input, make this script into a function to efficiently handle input arguments.
If you want an easy way to specify a file, see uigetfile.
With many MATLAB functions, such as max, min, sum, mean, etc., you can specify a dimension over which the function should operate. This way you an run it on a matrix and compute values for multiple conditions/dimensions at the same time.
Once you get decent performance, consider iterating longer, specifically until the centers no longer change or the number of samples that change clusters becomes small.
The cluster with the smallest distance for each point, ik, will be the same with squared Euclidean distance.
I have a matrix (X) of doubles containing time series. Some of the observations are set to NaN when there is a missing value. I want to calculate the standard deviation per column to get a std dev value for each column. Since I have NaNs mixed in, a simple std(X) will not work and if I try std(X(~isnan(X)) I end up getting the std dev for the entire matrix, instead of one per column.
Is there a way to simply omit the NaNs from std dev calculations along the 1st dim without resorting to looping?
Please note that I only want to ignore individual values as opposed to entire rows or cols in case of NaNs. Obviously I cannot set NaNs to zero or any other value as that would impact calculations.
Have a look at nanstd (stat toolbox).
The idea is to center the data using nanmean, then to replace NaN with zero, and finally to compute the standard deviation.
See nanmean below.
% maximum admissible fraction of missing values
max_miss = 0.6;
[m,n] = size(x);
% replace NaNs with zeros.
inan = find(isnan(x));
x(inan) = zeros(size(inan));
% determine number of available observations on each variable
[i,j] = ind2sub([m,n], inan); % subscripts of missing entries
nans = sparse(i,j,1,m,n); % indicator matrix for missing values
nobs = m - sum(nans);
% set nobs to NaN when there are too few entries to form robust average
minobs = m * (1 - max_miss);
k = find(nobs < minobs);
nobs(k) = NaN;
mx = sum(x) ./ nobs;
See nanstd below.
flag = 1; % default: normalize by nobs-1
% center data
xc = x - repmat(mx, m, 1);
% replace NaNs with zeros in centered data matrix
xc(inan) = zeros(size(inan));
% standard deviation
sx = sqrt(sum(conj(xc).*xc) ./ (nobs-flag));