Related
I am trying to set create pareto graphs based on a dataset from excel. The dataset has three columns "Comment", "part", and "number". The values in comment and number repeat as they are general while the part is independent. As such, I need to group them based on the part.
I've been able to create two pareto graphs. By getting the unique part numbers and counting the number of occurrences of unique comments, I have been able to create a plot of number of comments (y-axis) and part (x-axis). Now the part I've been struggling with is plotting the number of comments (y-axis) by the number (x- axis) for a specified part.
Data = readtable('Example_Dataset.xlsx')
Data = Data{:,:}
part = Data(:,2) %Gets part
number = Data(:,3) %Gets number
comments = Data(:,1) %gets comment
Unique_Part= unique(part,'stable')
b = cellfun(#(x) sum(ismember(part,x)),Unique_Part,'un',0)
Unique_number = unique(number,'stable')
c = cellfun(#(x) sum(ismember(number,x)),Unique_number,'un',0)
Unique_comments = unique(comments,'stable')
comment_type =cell2mat(Unique_comments)
comments_parts = cell2mat(b)
comments_number = cell2mat(c)
figure
pareto(comments_parts,Unique_part)
figure
pareto(comments_number,Unique_number)
A simplified dataset is shown here. It should be noted that they are not equal sizes, some repeat only once others repeat numberous times. And sometimes the part is not numeric.
https://imgur.com/a/V3MxeTD
Problem
I have a data set of describing geological structures. Each structure has a row with two attributes - its length and orientation (0-360 degrees).
Within this data set, there are two types of structure.
Type 1: less data points, but the structures are physically larger (large length, and so more significant).
Type 2: more data points, but the structures are physically smaller (small length, and so less significant).
I want to create a rose plot to show the spread of the structures' orientations. However, I want this plot to also represent the significance of the structures in combination with the direction they face - taking into account the lengths.
Is it possible to scale this by length in MATLAB somehow so that the subset which is less numerous is not under represented, when the structures are large?
Example
A data set might contain:
10 structures orientated North-South, 50km long.
100 structures orientated East-West, 0.5km long.
In this situation the East-West population would look to be more significant than the North-South population based on absolute numbers. However, in reality the length of the members contributing to this population are much smaller and so the structures are less significant.
Code
This is the code I have so far:
load('WG_rose_data.xy')
azimuth = WG_rose_data(:,2);
length = WG_rose_data(:,1);
rose(azimuth,20);
Where WG_rose_data.xy is a data file with 2 columns containing the length and azimuth (orientation) data for the geological structures.
For each row in your data, you could duplicate it a given number of times, according to its length value. Therefore, if you had a structure with length 50, it counts for 50 data points, whereas a structure with length 1 only counts as 1 data point. Of course you have to round your lengths since you can only have integer numbers of rows.
This could be achieved like so, with your example data in the matrix d
% Set up example data: 10 large vertical structures, 100 small ones perpendicular
d = [repmat([0, 50], 10, 1); repmat([90, .5], 100, 1)];
% For each row, duplicate the data in column 1, according to the length in column 2
d1 = [];
for ii = 1:size(d,1)
% make d(ii,2) = length copies of d(ii,1) = orientation
d1(end+1:end+ceil(d(ii,2))) = d(ii,1);
end
Output rose plot:
You could fine tune how to duplicate the data to achieve the desired balance of actual data and length weighting.
Thanks for all the help with this. This code is my final working version for reference:
clear all
close all
% Input dataset
original_data = load('WG_rose_data.xy');
d = [];
%reformat azimuth
d(:,1)= original_data(:,2);
%reformat length
d(:,2)= original_data(:,1);
% For each row, duplicate the data in column 1, according to the length in column 2
d1 = [];
for a = 1:size(d,1)
d1(end+1:end+ceil(d(a,2))) = d(a,1);
end
%create oposite directions for rose diagram
length_d1_azi = length(d1);
d1_op_azi=zeros(1,length_d1_azi);
for i = 1:length_d1_azi
d1_op_azi(i)=d1(i)-180;
if d1_op_azi(i) < 1;
d1_op_azi(i) = 360 - (d1_op_azi(i)*-1);
end
end
%join calculated oposites to original input
new_length = length_d1_azi*2;
all=zeros(new_length,1);
for i = 1:length_d1_azi
all(i)=d1(i);
end
for j = length_d1_azi+1:new_length;
all(j)=d1_op_azi(j-length_d1_azi);
end
%convert input aray into radians to plot
d1_rad=degtorad(all);
rose(d1_rad,24)
set(gca,'View',[-90 90],'YDir','reverse');
So, I'm trying to simulate an arbitrary model of a bag of marbles (with replacement, if that makes a difference in how this works) and am running into issues displaying the results.
How I have it set up is the code asks for how many marbles are in the bag, the how many you would like to pick, and then how many different colors there are. (Defined as N, S, and k respectively).
I then go through a loop between 1 and k in a cell array to name the colors of the marbles and then create a second array that simulates the probabilities by asking how many of each color there is in the bag.
I then generate a random matrix that simulates 10 "games" (ie: rDist=randi(N,[10,S]);
Now that I have the marbles that I've picked, I create another 10xS cell array and want to fill that cell array with the colors of the marbles based on the number picked. That is, let's say I have 10 marbles and 7 are red and 3 are green. If the PRNG picks 1:7, I want the results cell array to say "red" and if it chooses 8:10, I want "green" in the corresponding positions. I can do this for finite numbers, but I want to extend this to K marble colors with any number of distributions of marble colors. Can you offer any help?
My "finite" solution for 2 marble types is below:
for lc=1:10*S
counter=0;
if (rDist(lc)>=1 && (rDist(lc)<=Probabilities(1)))
Results{lc}=Color{1};
end
counter=Probabilities(1);
if (rDist(lc)>counter && (rDist(lc)<=counter+Probabilities(2)))
Results{lc}=Color{2};
end
end
You can calculate the intervals that correspond to each color with cumsum. Then you need to find which interval each entry of rDist belongs to.
numPicks = 5;
numGames = 10;
names = {'red', 'white', 'blue'};
counts = [2 6 9];
N = sum(counts);
cumsumCounts = cumsum(counts);
rDist=randi(N, [numGames, numPicks]);
out = cell(size(rDist));
for i = length(counts):-1:1
out(rDist <= cumsumCounts(i)) = names(i);
end
You could also do this with quantiz from the communication systems toolbox or randSample from the statistics and machine learning toolbox. Finally, you could use the more confusing one-liner out = names(arrayfun( #(x)( find(cumsumCounts >= x, 1) ), rDist));
I want to find the r^2 for each of the 3rd dimensions (the 3rd dimension is basically columns of data). However, in trying to index into each of the cells with a for loop (to loop through the states and then loop through the sets of data), I run into exceed index issues since some of the third dimensions are small, while others are larger.
I tried to sort the cells first:
[dummy, Index] = sort(cellfun('size', data_O3_spring, 3), 'descend');
S = data_O3_spring(Index);
And then loop through and find the corrcoef (using the data set data_O3_spring, which is in the same form as described above):
for k = 1:7 % Number of states
for j = 1:17 % largest number of sites
r2_spring{k}(:,:,j) = power((corrcoef(S{k}(:,:,j), data_PM25_spring{k}(:,:,j), 'rows', 'pairwise')), 2);
end
end
However, this gives me an exceed index error when I go above 5 (the size of the smallest set of data.
About the format of my data: data_O3_spring is a <1x7> cell containing data for 7 states for the months in spring.
data_O3_spring{1} (one of the states) has 7 cells (different sets of data I'm looking at), each of which is size:
<61x1x7 double>
<61x1x17 double>
<61x1x8 double>
<61x1x16 double>
<61x1x5 double>
<61x1x12 double>
<61x1x13 double>
61 is the number of days (rows). There's 1 column. And the third dimension size is the number of sets of data I'm looking at in that particular state (so it varies by state).
I tried using a while loop, but didn't manage to get it to work either.
I may be missing a detail, but it seems you can change your loop from:
for j=1:17,
to
for j = 1:size(S{k},3),
Each state has a different number of sites, and that's fine because you are storing the output in a cell array (r2_spring{k}(:,:,j)), which does not require that the dimension indexed by j be equal.
Also, pairing corrcoef(S{k}(:,:,j) with data_O3_spring{k}(:,:,j) is a problem since you've reordered data_O3_spring into S. I'd say to try either:
corrcoef(S{k}(:,:,j), S{k}(:,:,j), 'rows', 'pairwise')
or
corrcoef(data_O3_spring{k}(:,:,j), data_O3_spring{k}(:,:,j), 'rows', 'pairwise')
I have a matrix, X, in which I want to plot it using the kmeans function. What I would like: If row has a value of 1 in column 4 I would like it to be square shaped If the row has a value of 2 in column 4 I would like it + shaped BUT If the row has a value of 0 in column 5 it must be blue and if the row has a vale of 1 in column 5 it must be yellow
(You don't need to use these exact colors and shapes, I just want to distinguish these.) I tried this and it did not work:
plot(X(idx==2,1),X(idx==2,2),X(:,4)==1,'k.');
Thanks!!
Based on the example on the kmeans documentation page I propose this "nested" logic:
X = [randn(100,2)+ones(100,2);...
randn(100,2)-ones(100,2)];
opts = statset('Display','final');
% This gives a random distribution of 0s and 1s in column 5:
X(:,5) = round(rand(size(X,1),1));
[idx,ctrs] = kmeans(X,2,...
'Distance','city',...
'Replicates',5,...
'Options',opts);
hold on
plot(X(idx==1,1),X(idx==1,2),'rs','MarkerSize',12)
plot(X(idx==2,1),X(idx==2,2),'r+','MarkerSize',12)
% after plotting the results of kmeans,
% plot new symbols with a different logic on top:
plot(X(X(idx==1,5)==0,1),X(X(idx==1,5)==0,2),'bs','MarkerSize',12)
plot(X(X(idx==1,5)==1,1),X(X(idx==1,5)==1,2),'gs','MarkerSize',12)
plot(X(X(idx==2,5)==0,1),X(X(idx==2,5)==0,2),'b+','MarkerSize',12)
plot(X(X(idx==2,5)==1,1),X(X(idx==2,5)==1,2),'g+','MarkerSize',12)
The above code is a minimal working example, given that the statistics toolbox is available.
The key feature is the nested logic for the plotting. For example:
X(X(idx==1,5)==0,1)
The inner X(idx==1,5) selects those values of X(:,5) for which idx==1. From those, only values which are 0 are considered: X(X(...)==0,1). Based on the logic in the question, this should be a blue square: bs.
You have four cases, hence there are four additional plot lines.