Matlab beginner median , mode and binning - matlab

I am a beginner with MATLAB and I am struggling with this assignment. Can anyone guide me through it?
Consider the data given below:
x = [ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , ...
34, 47, 455, 21, , 22, 100 ];
Once the data is loaded, see if you can find any:
Outliers or
Missing data in the data file
Correct the missing values using median, mode and noisy data using median binning, mean binning and bin boundaries.

This isn't so bad. First off, take a look at the distribution of your data. You can see that the majority of your data has double digits. The outliers are those with single digits, or those that are way larger than double digits. Mind you, this is totally subjective so someone else may tell you that the single digits are part of your data too. Also, the missing data are those numbers that are spaces in between the commas. Let's write some MATLAB code and change these to NaN (or not-a-number), because if you try copying and pasting this code directly into MATLAB, it will give you a syntax error because if you are explicitly defining numbers this way, you have to be sure all of them are there.
To do this, use regexprep so that any parts of this string that have a comma, space, then another comma, put a NaN in between. To do this, we need to put this statement as a string first. We then use eval to convert this string to an actual MATLAB statement:
x = '[ 1 , 48 , 81 , 2 , 10 , 25 , ,14 , 18 , 53 , 41, 56, 89,0, 1000, , 34, 47, 455, 21, , 22, 100 ];'
y = eval(regexprep(x, ', ,', ', NaN, '));
If we display this data, we get:
y =
Columns 1 through 6
1 48 81 2 10 25
Columns 7 through 12
NaN 14 18 53 41 56
Columns 13 through 18
89 0 1000 NaN 34 47
Columns 19 through 23
455 21 NaN 22 100
As such, to answer our first question, any values that are missing are denoted as NaN and those numbers that are bigger than double digits are outliers.
For the next question, we simply extract those values that are not missing, calculate the mean and median of what is not missing, and fill in those NaN values with the mean and median. For the bin boundaries, this is the same thing as using the values to the left (or right... depends on your definition, but let's use left) of the missing value and fill those in. As such:
yMissing = isnan(y); %// Which values are missing?
y_noNaN = y(~yMissing); %// Extract the non-missing values
meanY = mean(y_noNaN); %// Get the mean
medianY = median(y_noNaN); %// Get the median
%// Output - Fill in missing values with median
yMedian = y;
yMedian(yMissing) = medianY;
%// Same for mean
yMean = y;
yMean(yMissing) = meanY;
%// Bin boundaries
yBinBound = y;
yBinBound(yMissing) = y(find(yMissing)-1);
The mean and median for the data of the non-missing values is:
meanY =
105.8500
medianY =
37.5000
The outputs for each of these, in addition to the original data with the missing values looks like:
format bank; %// Do this to show just the first two decimal places for compact output
format compact;
y =
Columns 1 through 5
1 48 81 2 10
Columns 6 through 10
25 NaN 14 18 53
Columns 11 through 15
41 56 89 0 1000
Columns 16 through 20
NaN 34 47 455 21
Columns 21 through 23
NaN 22 100
yMean =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 105.85 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
105.85 34.00 47.00 455.00 21.00
Columns 21 through 23
105.85 22.00 100.00
yMedian =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 37.50 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
37.50 34.00 47.00 455.00 21.00
Columns 21 through 23
37.50 22.00 100.00
yBinBound =
Columns 1 through 5
1.00 48.00 81.00 2.00 10.00
Columns 6 through 10
25.00 25.00 14.00 18.00 53.00
Columns 11 through 15
41.00 56.00 89.00 0 1000.00
Columns 16 through 20
1000.00 34.00 47.00 455.00 21.00
Columns 21 through 23
21.00 22.00 100.00
If you take a look at each of the output values, this fills in our data with the mean, median and also the bin boundaries as per the question.

Related

Replace values within matlab matrix using column values from another matrix

I have a big matrix (8656x25960) with some speckle noise within it. I used the findpeaks tool in order to find in what columns I indeed have peaks above a certain threshold. The output of the findspeaks tool is a matrix containing all of the bad columns, for example -
loc =
Columns 1 through 6
30 51 155 307 333 338
Columns 7 through 12
642 955 1409 1567 1728 1730
Columns 13 through 18
2332 2546 2615 2685 2806 2995
Columns 19 through 24
3002 3122 3124 3164 3690 4176
Columns 25 through 30
4430 4475 4539 5142 5155 5244
Columns 31 through 36
5246 5941 5943 6114 6486 6922
Columns 37 through 42
7165 7169 7460 7587 7647 8944
Columns 43 through 44
12754 13693
How can I use those columns numbers with the original matrix and replace the values of this 'bad' column with the value 0 (for example).
Hoping I'm clear enough.
For row vector Ioc simply use indexing:
yourmatrix(:,Ioc) = 0;

Embedding an array into another

I have two arrays. The first one is a consecutive sequential one, like:
seq1 =
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
...continues
The second one is like:
seq2 =
2 250
3 260
5 267
6 270
8 280
10 290
13 300
18 310
20 320
21 330
...continues
I need to embed seq2 into seq1 in such a way that I end up with the sequence:
seq3 =
1 0
2 250
3 260
4 260
5 267
6 270
7 270
8 280
9 280
10 290
11 290
... continues
I could do this with loops but the arrays are really big so I don't want to use two for loops, it is taking too long. How can I do this in a vectorised manner?
I think this does what you want:
[~, jj, vv] = find(sum(bsxfun(#le, seq2(:,1), seq1(:,1).'), 1));
seq3 = seq1;
seq3(jj,2) = seq2(vv,2);
How it works
The required index is obtained by computing how many values in the first column of seq2 are less than or equal to each value in the first column or seq1 (code sum(bsxfun(#le, ...), 1)). This will be used to select the appropriate entries from the second column of seq2 and write them into the result. But before that, the value 0 in this index needs to be discarded. This is done using the three-output version of find (code [~, jj, vv] = find(...)).
If your second column of data is always increasing, you can solve this easily with accumarray and cummax:
seq = [seq1; seq2];
seq3 = cummax(accumarray(seq(:, 1), seq(:, 2), [], #max));
seq3 = [(1:numel(seq3)).' seq3];
And here's what you get for your sample inputs:
seq3 =
1 0
2 250
3 260
4 260
5 267
6 270
7 270
8 280
9 280
10 290
11 290
12 290
13 300
14 300
15 300
16 300
17 300
18 310
19 310
20 320
21 330
How it works...
After concatenating seq1 and seq2, accumarray collects all the values in the second column that have the same value in the first column (i.e. [0 250] for the value 2), then gets the maximum value of each set. The function cummax is then used to fill any zero values with the previous non-zero value. Finally, an index column is added to the new sequence.

Excluding NaNs from data sorting for box plot with overlaid data

I'm plotting a box plot with overlaid data from the following concatenated matrix:
data = [10 16 24 31 12 26 23 33;11 15 27 27 12 24 22 36;12 15 24 25 14 25 22 37;10 16 27 24 14 27 23 41;12 15 NaN NaN 15 NaN 22 NaN;13 18 NaN NaN 16 NaN 22 NaN]
The code for this plot is:
datas=sort(data);
datainbox=datas(ceil(end/4)+1:floor(end*3/4),:);
[n1,n2]=size(datainbox);
dataoutbox=datas([1:ceil(end/4) floor(end*3/4)+1:end],:);
n3=size(dataoutbox,1);
% calculate quartiles
dataq=quantile(data,[.25 .5 .75]);
% calculate range between box and outliers = between 1.5*IQR from quartiles
dataiqr=iqr(data);
datar=[dataq(1,:)-dataiqr*1.5;dataq(3,:)+dataiqr*1.5];
dataoutbox(dataoutbox<ones(n3,1)*datar(1,:)|dataoutbox>ones(n3,1)*datar(2,:))=nan;
figure()
hold on
bp = boxplot(data);
plot(ones(n1,1)*[1 2 3 4 5 6 7 8]+.4*(rand(n1,n2)-.5),datainbox,'k.','MarkerSize',12)
plot(ones(n3,1)*[1 2 3 4 5 6 7 8]+.4*(rand(n3,n2)-.5),dataoutbox,'.','color',[1 1 1]*.5,'MarkerSize',12)
set(bp,'linewidth',1);
As indicated above, I am sorting the data into 'datainbox' and 'dataoutbox' based on the IQR. The code works as expected (credit to JJM Driesson) except for the data columns containing NaNs, where as shown in the plot the data is not sorted correctly. How should I modify the above code to exclude NaNs from calculations and prevent this from influencing the plot?
Thank you for your time,
Laura
You should process every column separately. You can select the NaN values as follows: col = data(~isnan(data(:, i)), i);
If you want all the boxplots in the same figure, you can try to use this answer.

Matlab replace the nan with average of previous and next non-missing value

all,
I have a large dataset with a lot of continuous NAs, is there any fast way to replace the NAs with the average of previous and next non-missing value by column?
Thanks a lot
Lou
Interesting question... if only you explained clearly what you want. Maybe it's this?
data = [1 3 NaN 7 6 NaN NaN 2].'; %'// example data: column vector
isn = isnan(data); %// determine which values are NaN
inum = find(~isn); %// indices of numbers
inan = find(isn); %// indices of NaNs
comp = bsxfun(#lt,inan.',inum); %'// for each (number,NaN): 1 if NaN precedes num
[~, upper] = max(comp); %// next number to each NaN (max finds *first* maximum)
data(isn) = (data(inum(upper))+data(inum(upper-1)))/2; %// fill with average
In this example: original data:
>> data.'
ans =
1 3 NaN 7 6 NaN NaN 2
Result:
>> data.'
ans =
1 3 5 7 6 4 4 2
If you have a 2D array and want to work by columns, a for loop over columns is probably the best option.
And of course, if there can be NaN's at the beginning or end of a column, the problem is undefined.
Assuming NaNs are not in the first/last row in any column, here is how I would do it:
(If there are multiple consecutive NaNs, it searches for previous ann next non-missing values and averages them).
% Creating A
A=magic(7);
newA=A; %Result will be in newA
A(3,4)=NaN;
A(2,1)=NaN;
A(5,6)=NaN;
A(6,6)=NaN;
A(4,6)=NaN;
% Finding NaN position and calculating positions where we have to average numbers
ind=find(isnan(A));
otherInd=setdiff(1:numel(A(:)),ind);
for i=1:size(ind,1)
temp=otherInd(otherInd<ind(i));
prevInd(i,1)=temp(end);
temp=otherInd(otherInd>ind(i));
nextInd(i,1)=temp(1);
end
% For faster processing purposes
allInd(1:2:2*length(prevInd))=prevInd;
allInd(2:2:2*length(prevInd))=nextInd;
fun=#(block_struct) mean(block_struct.data)
prevNextNums=A(allInd);
A
newA(ind)=blockproc(prevNextNums,[1 2],fun)
%-----------------------Answer--------------------------
A =
30 39 48 1 10 19 28
NaN 47 7 9 18 27 29
46 6 8 NaN 26 35 37
5 14 16 25 34 NaN 45
13 15 24 33 42 NaN 4
21 23 32 41 43 NaN 12
22 31 40 49 2 11 20
newA =
30 39 48 1 10 19 28
38 47 7 9 18 27 29
46 6 8 17 26 35 37
5 14 16 25 34 23 45
13 15 24 33 42 23 4
21 23 32 41 43 23 12
22 31 40 49 2 11 20

How to extract new matrix from existing one

I have a large number of entries arranged in three columns. Sample of the data is:
A=[1 3 2 3 5 4 1 5 ;
22 25 27 20 22 21 23 27;
17 15 15 17 12 19 11 18]'
I want the first column (hours) to control the entire matrix to create new matrix as follows:
Anew=[1 2 3 4 5 ; 22.5 27 22.5 21 24.5; 14 15 16 19 15]'
Where the 2nd column of Anew is the average value of each corresponding hour for example:
from matrix A:
at hour 1, we have 2 values in 2nd column correspond to hour 1
which are 22 and 23 so the average is 22.5
Also the 3rd column: at hour 1 we have 17 and 11 and the
average is 14 and this continues to the hour 5 I am using Matlab
You can use ACCUMARRAY for this:
Anew = [unique(A(:,1)),...
cell2mat(accumarray(A(:,1),1:size(A,1),[],#(x){mean(A(x,2:3),2)}))]
This uses the first column A(:,1) as indices (x) to pick the values in columns 2 and 3 for averaging (mean(A(x,2:3),1)). The curly brackets and the call to cell2mat allow you to work on both columns at once. Otherwise, you could do each column individually, like this
Anew = [unique(A(:,1)), ...
accumarray(A(:,1),A(:,2),[],#mean), ...
accumarray(A(:,1),A(:,3),[],#mean)]
which may actually be a bit more readable.
EDIT
The above assumes that there's no missing entry for any of the hours. It will result in an error otherwise. Thus, a more robust way to calculate Anew is to allow for missing values. For easy identification of the missing values, we use the fillval input argument to accumarray and set it to NaN.
Anew = [(1:max(A(:,1)))', ...
accumarray(A(:,1),A(:,2),[],#mean,NaN), ...
accumarray(A(:,1),A(:,3),[],#mean,NaN)]
You can use consolidator to do the work for you.
[Afinal(:,1),Afinal(:,2:3)] = consolidator(A(:,1),A(:,2:3),#mean);
Afinal
Afinal =
1 22.5 14
2 27 15
3 22.5 16
4 21 19
5 24.5 15