Related
I'm trying to find a fit to my data using matlab but I'm having a lot of trouble, here's what ive done so far:
A = load('homicide_crime.txt') % A is a two column array the first column is the year and the second column is crime in that year
norm_crime = (A(:,2)-mean(A(:,2)))/std(A(:,2));
[f,x]=hist(norm_crime,20);
plot(x,f/trapz(x,f))
y=normpdf(x,0,1);
hold on
plot(x,y)
This is the resulting plot
.
So i tried afterwards using the distribution fitter which gave me this.
Neither of these things look right since the peak aren't aligned and the fit is too small.
Here is the data set for anyone intrested
https://pastebin.com/CyddrN1R.
Any help is much appreciated.
Actually, I think you are confusing data transformation with distribution fitting.
DATA TRANSFORMATION
In this approach, data is manipulated through a non-linear transformation in order to achieve a perfect fit. This means that it forces your data to follow the chosen distribution rule. To accomplish this with a normal distribution, all you have to do is applying the following code:
A = load('homicide_crime.txt');
years = A(:,1);
crimes = A(:,2);
figure(),histfit(crimes);
rank = tiedrank(crimes);
p = rank ./ (numel(rank) + 1);
crimes_normal = norminv(p,0,1);
figure(),histfit(crimes_normal);
Using the following manipulation:
crimes_normal = (crimes - mean(crimes)) ./ std(crimes);
that can also be written as:
crimes_normal = zscore(crimes);
you modify your observations so that they have mu=0 and sigma=1, but this is far from making them perfectly fit a normal distribution.
DISTRIBUTION FITTING
In this approach, the parameters of the chosen distribution are calculated over the given dataset, and then random observations are drawn. On one side you have your empirical observations, and on the other side you have your fitted data. A goodness-of-fit test can finally tell you how well empirical observations fit the given distribution comparing them to theoretical observations.
Since your are working with a normal distribution, you know that it is fully described by two parameters: mu and sigma. Hence:
A = load('homicide_crime.txt');
years = A(:,1);
crimes_emp = A(:,2);
[mu,sigma] = normfit(crimes_emp);
% you can also use
% mu = mean(crimes);
% sigma = std(crimes);
% to achieve the same result
[f,x] = hist(crimes_emp);
crimes_the = normpdf(x,mu,sigma) .* max(f);
figure();
bar(x, (f ./ sum(f)));
hold on;
plot(x,crimes_the,'-r','LineWidth',2);
hold off;
And this returns something very close to the problem you originally noticed. As you can clearly see, without even running a Kolmogorov-Smirnov or an Anderson-Darling... your data doesn't fit a normal distribution well.
You can try a non-parametric density estimation method. I used kernel density estimation (KDE) with the default normal distribution as the kernel, to obtain the result as shown below. The Matlab command for the same is ksdensity() and documentation can be found here.
A = load('homicide_crime.txt') % load data
years = A(:,1);
values = A(:,2);
[f0,x0] = hist(values,100); % plot the actual histogram
[f1,x1,b1] = ksdensity(values); % KDE with automatically assigned bandwidth
[f2,x2,b2] = ksdensity(values,'Bandwidth',b1*0.6); % 60% of initial bandwidth (b1)
[f3,x3,b3] = ksdensity(values,'Bandwidth',b2*0.6); % 60% of previous bandwidth (b2) = 36% of initial bandwidth (b1)
[f4,x4,b4] = ksdensity(values,'Bandwidth',b3*0.6); % 60% of previous bandwidth (b3) = 21.6% of initial bandwidth (b1)
figure; hold on;
bar(years, f0/(sum(f0)*10) ); % scaled for visualization
plot(years, f1, 'y')
plot(years, f2, 'c')
plot(years, f3, 'g')
plot(years, f4, 'r','linewidth',3) % Final fit
In the code above, I first plot the histogram and then calculate the kde without any user specified bandwidth. This leads to an oversmooth fitting (yellow curve). With a few trials by reducing the bandwidth as only 60% of the previous iteration, I finally was able to get the closest fit (red curve). You can play around the bandwidth to get a still better desirable fit.
I am currently doing a thesis that needs Ultrasonic pulse velocity(UPV). UPV can easily be attained via the machines but the data we acquired didn't have UPV so we are tasked to get it manually.
Essentially in the data we have 2 channels, one for the transmitter transducer and another for a receiver transducer.
We need to get the time from wave from the transmitter is emitted and the time it arrives to the receiver.
Using matlab, I've tried finddelay and xcorr but doesnt quite get the right result.
Here is a picture of the points I would want to get. The plot is of the transmitter(blue) and receiver(red)
So I am trying to find the two points in the picture but with the aid of matlab. The two would determine the time and further the UPV.
I am relatively a new MATLAB user so your help would be of great assistance.
Here is the code I have tried
[cc, lags] = xcorr(signal1,signal2);
d2 = -(lags(cc == max(cc))) / Fs;
#xenoclast hi there! so far the code i used are these.
close all
clc
Fs = input('input Fs = ');
T = 1/Fs;
L = input('input L = ');
t = (0:L-1)*T;
time = transpose(t);
i = input('input number of steploads = ');
% construct test sequences
%dataupv is the signal1 & datathesis is the signal2
for m=1:i
y1 = (dataupv(:,m) - mean(dataupv(:,m))) / std(dataupv(:,m));
y2 = (datathesis(:,m) - mean(datathesis(:,m))) / std(datathesis(:,m));
offset = 166;
tt = time;
% correlate the two sequences
[cc, lags] = xcorr(y2, y1,);
% find the in4dex of the maximum
[maxval, maxI] = max(cc);
[minval, minI] = min(cc);
% use that index to obtain the lag in samples
lagsamples(m,1) = lags(maxI);
lagsamples2(m,1) = lags(minI);
% plot again without timebase to verify visually
end
the resulting value is off by 70 samples compared to when i visually inspect the waves. the lag resulted in 244 but visually it should be 176 here are the data(there are 19 sets of data but i only used the 1st column) https://www.dropbox.com/s/ng5uq8f7oyap0tq/datatrans-dec-2014.xlsx?dl=0 https://www.dropbox.com/s/1x7en0x7elnbg42/datarec-dec-2014.xlsx?dl=0
Your example code doesn't specify Fs so I don't know for sure but I'm guessing that it's an issue of sample rate(s). All the examples of cross correlation start out by constructing test sequences according to a specific sample rate that they usually call Fs, not to be confused with the frequency of the test tone, which you may see called Fc.
If you construct the test signals in terms of Fs and Fc then this works fine but when you get real data from an instrument they often give you the data and the timebase as two vectors, so you have to work out the sample rate from the timebase. This may not be the same as the operating frequency or the components of the signal, so confusion is easy here.
But the sample rate is only required in the second part of the problem, where you work out the offset in time. First you have to get the offset in samples and that's a lot easier to verify.
Your example will give you the offset in samples if you remove the '/ Fs' term and you can verify it by plotting the two signals without a timebase and inspecting the sample positions.
I'm sure you've looked at dozens of examples but here's one that attempts to not confuse the issue by tying it to sample rates - you'll note that nowhere is it specified what the 'sample rate' is, it's just tied to samples (although if you treat the 5 in the y1 definition as a frequency in Hz then you will be able to infer one).
% construct test sequences
offset = 23;
tt = 0:0.01:1;
y1 = sin(2*pi*5*tt);
y2 = 0.8 * [zeros(1, offset) y1];
figure(1); clf; hold on
plot(tt, y1)
plot(tt, y2(1:numel(tt)), 'r')
% correlate the two sequences
[cc, lags] = xcorr(y2, y1);
figure(2); clf;
plot(cc)
% find the index of the maximum
[maxval, maxI] = max(cc);
% use that index to obtain the lag in samples
lagsamples = lags(maxI);
% plot again without timebase to verify visually
figure(3); clf; hold on
plot(y1)
plot(y2, 'r')
plot([offset offset], [-1 1], 'k:')
Once you've got the offset in samples you can probably deduce the required conversion factor, but if you have timebase data from the instrument then the inverse of the diff of any two consecutive entries will give it you.
UPDATE
When you correlate the two signals you can visualise it as overlaying them and summing the product of corresponding elements. This gives you a single value. Then you move one signal by a sample and do it again. Continue until you have done it at every possible arrangement of the two signals.
The value obtained at each step is the correlation, but the 'lag' is computed starting with one signal all the way over to the left and the other overlapping by only one sample. You slide the second signal all the way over until it's only overlapping the other end by a sample. Hence the number of values returned by the correlation is related to the length of both the original signals, and relating any given point in the correlation output, such as the max value, to the arrangement of the two signals that produced it requires you to do a calculation involving those lengths. The xcorr function makes this easier by outputting the lags variable, which tracks the alignment of the two signals. People may also talk about this as an offset so watch out for that.
Hi I'm working on trying to cluster network data from the 1999 darpa data set. Unfortunately I'm not really getting clustered data, not compared to some of the literature, using the same techniques and methods.
My data comes out like this:
As you can see, it is not very Clustered. This is due to a lot of outliers (noise) in the dataset. I have looked at some outlier removal techniques but nothing I have tried so far really cleans the data. One of the methods I have tried:
%% When an outlier is considered to be more than three standard deviations away from the mean, determine the number of outliers in each column of the count matrix:
mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 3*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers)
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];
In the first run, it removed 48 rows from the 1000 normalized random rows which are selected from the full dataset.
This is the full script I used on the data:
%% load data
%# read the list of features
fid = fopen('kddcup.names','rt');
C = textscan(fid, '%s %s', 'Delimiter',':', 'HeaderLines',1);
fclose(fid);
%# determine type of features
C{2} = regexprep(C{2}, '.$',''); %# remove "." at the end
attribNom = [ismember(C{2},'symbolic');true]; %# nominal features
%# build format string used to read/parse the actual data
frmt = cell(1,numel(C{1}));
frmt( ismember(C{2},'continuous') ) = {'%f'}; %# numeric features: read as number
frmt( ismember(C{2},'symbolic') ) = {'%s'}; %# nominal features: read as string
frmt = [frmt{:}];
frmt = [frmt '%s']; %# add the class attribute
%# read dataset
fid = fopen('kddcup.data_10_percent_corrected','rt');
C = textscan(fid, frmt, 'Delimiter',',');
fclose(fid);
%# convert nominal attributes to numeric
ind = find(attribNom);
G = cell(numel(ind),1);
for i=1:numel(ind)
[C{ind(i)},G{i}] = grp2idx( C{ind(i)} );
end
%# all numeric dataset
fulldata = cell2mat(C);
%% dimensionality reduction
columns = 6
[U,S,V]=svds(fulldata,columns);
%% randomly select dataset
rows = 1000;
columns = 6;
%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';
%# pick random columns
indY = indY(1:columns);
%# filter data
data = U(indX,indY);
% apply normalization method to every cell
maxData = max(max(data));
minData = min(min(data));
data = ((data-minData)./(maxData));
% output matching data
dataSample = fulldata(indX, :)
%% When an outlier is considered to be more than three standard deviations away from the mean, use the following syntax to determine the number of outliers in each column of the count matrix:
mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 2.5*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers)
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];
%% generate sample data
K = 6;
numObservarations = size(data, 1);
dimensions = 3;
%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
grid on
view([90 0]);
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
This is two distinct clusters from the output:
As you can see the data looks cleaner and more clustered than the original. However I still think a better method can be used.
For instance observing the overall clustering, I still have a lot of noise (outliers) from the dataset. As can be seen here:
I need the outlier rows put into a seperate dataset for later classification (only removed from the clustering)
Here is a link to the darpa dataset, please note that the 10% data set has had significant reduction in columns, a majority of columns which have 0 or 1's running through-out have been removed (42 columns to 6 columns):
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
EDIT
Columns kept in the dataset are:
src_bytes: continuous.
dst_bytes: continuous.
count: continuous.
srv_count: continuous.
dst_host_count: continuous.
dst_host_srv_count: continuous.
RE-EDIT:
Based on discussions with Anony-Mousse and his answer, there may be a way of reducing noise in the clustering using K-Medoids http://en.wikipedia.org/wiki/K-medoids. I'm hoping that there isnt much of a change in the code that I currently have but as of yet I do not know how to implement it to test whether this will significantly reduce the noise. So providing that someone can show me a working example this will be accepted as an answer.
Note that using this dataset is discouraged:
That dataset has errors: KDD Cup '99 dataset (Network Intrusion) considered harmful
Reconsider using a different algorithm. k-means is not really appropriate for mixed-type data, where many attributes are discrete, and have very different scales. K-means needs to be able to compute sensible means. And for a binary vector "0.5" is not a sensible mean, it should be either 0 or 1.
Plus, k-means doesn't like outliers too much.
When plotting, make sure to scale them equally, or the result will look incorrect. You X-axis has a length of around 0.9, your y axis only 0.2 - no wonder they look squashed.
Overall, maybe the data set just doesn't have k-means-style clusters? You definitely should try a density-based methods (because these can deal with outliers) such as DBSCAN. But judging from the visualizations you added, I'd say it has at most 4-5 clusters, and they are not really interesting. They probably can be captured with a number of thresholds in some dimensions.
Here is a visualization of the data set after z-normalization, visualized in parallel coordinates, with 5000 samples. Bright green is normal.
You can clearly see special properties of the data set. All of the attacks are clearly different in attributes 3 and 4 (count and srv_count) and also most very concentrated in dst_host_count and dst_host_srv_count.
I've ran OPTICS on this data set, too. It found a number of clusters, most of them in the wine-colored attack pattern. But they're not really interesting. If you have 10 different hosts ping-flooding, they will form 10 clusters.
You can see very well that OPTICS managed to cluster a number of these attacks. It missed all the orange stuff (maybe if I had set minpts lower, it is quite spread out) but it even discovered *structure within the wine-colored attack), breaking it into a number of separate events.
To really make sense of this data set, you should start with feature extraction, for example by merging such ping flood connection attempts into an aggregate event.
Also note that this is an unrealistic scenario.
There are well-known patterns involved in attacks, in particular port scans. These are best detected with a specialized port scan detector, not with learning.
The simulated data has a lot of completely pointless "attacks" simulated. For example Smurf attack from the 90s, is >50% of the data set, and Syn flood is another 20%; while normal traffic is <20%!
For these kind of attacks, there are well-known signatures.
Much of modern attacks (SQL injection, for example) flow with usual HTTP traffic, and will not show up anomalous in raw traffic pattern.
Just don't use this data for classification or outlier detection. Just don't.
Quoting the KDNuggets link above:
As a result, we strongly recommend that
(1) all researchers stop using the KDD Cup '99 dataset,
(2) The KDD Cup and UCI websites include a warning on the KDD Cup '99 dataset webpage informing researchers that there are known problems with the dataset, and
(3) peer reviewers for conferences and journals ding papers (or even outright reject them, as is common in the network security community) with results drawn solely from the KDD Cup '99 dataset.
This is neither real nor realistic data. Go get something else.
First things first: you're asking for a lot here. For future reference: try to break up your problem in smaller chunks, and post several questions. This increases your chances of getting answers (and doesn't cost you 400 reputation!).
Luckily for you, I understand your predicament, and just love this sort of question!
Apart from this dataset's possible issues with k-means, this question is still generic enough to apply also to other datasets (and thus Googlers ending up here looking for a similar thing), so let's go ahead and get this solved.
My suggestion is we edit this answer until you get reasonably satisfactory results.
Number of clusters
Step 1 of any clustering problem: how many clusters to choose? There are a few methods I know of with which you can select the proper number of clusters. There is a nice wiki page about this, containing all of the methods below (and a few more).
Visual inspection
It might seem silly, but if you have well-separated data, a simple plot can tell you already (approximately) how many clusters you'll need, just by looking.
Pros:
quick
simple
works well on well-separated clusters in relatively small datasets
Cons:
and dirty
requires user interaction
it's easy to miss smaller clusters
data with less-well separated clusters, or very many of them, are hard to do by this method
it is all rather subjective -- the next person might select a different amount than you did.
silhouettes plot
As indicated in one of your other questions, making a silhouettes plot will help you make a better decision about the proper number of clusters in your data.
Pros:
relatively simple
reduces subjectivity by using statistical measures
intuitive way to represent quality of the choice
Cons:
requires user interaction
In the limit, if you take as many clusters as there are datapoints, a silhouettes plot will tell you that that is the best choice
it is still rather subjective, not based on statistical means
can be computationally expensive
elbow method
As with the silhouettes plot approach, you run kmeans repeatedly, each time with a larger amount of clusters, and you see how much of the total variance in the data is explained by the clusters chosen by this kmeans run. There will be a number of clusters where the amount of explaned variance will suddenly increase a lot less than in any of the previous choices of the number of clusters (the "elbow"). The elbow is then statistically speaking the best choice for the number of clusters.
Pros:
no user interaction required -- the elbow can be selected automatically
statistically more sound than any of the aforementioned methods
Cons:
somewhat complicated
still subjective, since the definition of the "elbow" depends on subjectively chosen parameters
can be computationally expensive
Outliers
Once you have chosen the number of clusters with any of the methods above, it is time to do outlier detection to see if the quality of your clusters improves.
I would start by a two-step-iterative approach, using the elbow method. In pseudo-Matlab:
data = your initial dataset
dataMod = your initial dataset
MAX = the number of clusters chosen by visual inspection
while (forever)
for N = MAX-5 : MAX+5
if (N < 1), continue, end
perform k-means with N clusters on dataMod
if (variance explained shows a jump)
break
end
if (you are satisfied)
break
end
for i = 1:N
extract all points from cluster i
find the centroid (let k-means do that)
calculate the standard deviation of distances to the centroid
mark points further than 3 sigma as possible outliers
end
dataMod = data with marked points removed
end
The tough part is obviously determining whether you are satisfied.
This is the key to the algorithm's effectiveness. The rough structure of
this part
if (you are satisfied)
break
end
would be something like this
if (situation has improved)
data = dataMod
elseif (situation is same or worse)
dataMod = data
break
end
the situation has improved when there are fewer outliers, or the variance
explaned for ALL choices of N is better than during the previous loop in the while. This is also something to fiddle with.
Anyway, much more than a first attempt I wouldn't call this.
If anyone sees incompletenesses, flaws or loopholes here, please
comment or edit.
I have a Matlab figure I want to use in a paper. This figure contains multiple cdfplots.
Now the problem is that I cannot use the markers because the become very dense in the plot.
If i want to make the samples sparse I have to drop some samples from the cdfplot which will result in a different cdfplot line.
How can I add enough markers while maintaining the actual line?
One method is to get XData/YData properties from your curves follow solution (1) from #ephsmith and set it back. Here is an example for one curve.
y = evrnd(0,3,100,1); %# random data
%# original data
subplot(1,2,1)
h = cdfplot(y);
set(h,'Marker','*','MarkerSize',8,'MarkerEdgeColor','r','LineStyle','none')
%# reduced data
subplot(1,2,2)
h = cdfplot(y);
set(h,'Marker','*','MarkerSize',8,'MarkerEdgeColor','r','LineStyle','none')
xdata = get(h,'XData');
ydata = get(h,'YData');
set(h,'XData',xdata(1:5:end));
set(h,'YData',ydata(1:5:end));
Another method is to calculate empirical CDF separately using ECDF function, then reduce the results before plotting with PLOT.
y = evrnd(0,3,100,1); %# random data
[f, x] = ecdf(y);
%# original data
subplot(1,2,1)
plot(x,f,'*')
%# reduced data
subplot(1,2,2)
plot(x(1:5:end),f(1:5:end),'r*')
Result
I know this is potentially unnecessary given MATLAB's built-in functions (in the Statistics Toolbox anyway) but it may be of use to other viewers who do not have access to the toolbox.
The empirical CMF (CDF) is essentially the cumulative sum of the empirical PMF. The latter is attainable in MATLAB via the hist function. In order to get a nice approximation to the empirical PMF, the number of bins must be selected appropriately. In the following example, I assume that 64 bins is good enough for your data.
%# compute a histogram with 64 bins for the data points stored in y
[f,x]=hist(y,64);
%# convert the frequency points in f to proportions
f = f./sum(f);
%# compute the cumulative sum of the empirical PMF
cmf = cumsum(f);
Now you can choose how many points you'd like to plot by using the reduced data example given by yuk.
n=20 ; % number of total data markers in the curve graph
M_n = round(linspace(1,numel(y),n)) ; % indices of markers
% plot the whole line, and markers for selected data points
plot(x,y,'b-',y(M_n),y(M_n),'rs')
verry simple.....
try reducing the marker size.
x = rand(10000,1);
y = x + rand(10000,1);
plot(x,y,'b.','markersize',1);
For publishing purposes I tend to use the plot tools on the figure window. This allow you to tweak all of the plot parameters and immediately see the result.
If the problem is that you have too many data points, you can:
1). Plot using every nth sample of the data. Experiment to find an n that results in the look you want.
2). I typically fit curves to my data and add a few sparsely placed markers to plots of the fits to differentiate the curves.
Honestly, for publishing purposes I have always found that choosing different 'LineStyle' or 'LineWidth' properties for the lines gives much cleaner results than using different markers. This would also be a lot easier than trying to downsample your data, and for plots made with CDFPLOT I find that markers simply occlude the stairstep nature of the lines.
I have 2 800x1 arrays in Matlab which contain my amplitude vs. frequency data, one array contains the magnitude, the other contains the corresponding values for frequency. I want to find the frequency at which the amplitude has reduced to half of its maximum value.
What would be the best way to do this? I suppose my two main concerns are: if the 'half amplitude' value lies between two data points, how can I find it? (e.g. if the value I'm looking for is 5, how can I "find it in my data" if it lies between two data points such as 4 and 6?)
and if I find the 'half amplitude' value, how do I then find the corresponding value for frequency?
Thanks in advance for your help!
You can find the index near your point of interest by doing
idx = magnitudes >= (max(magnitude)/2);
And then you can see all the corresponding frequencies, including the peak, by doing
disp(frequencies(idx))
You can add more conditions to the idx calculation if you want to see less extraneous stuff.
However, your concern about finding the exact frequency is harder to answer. It will depend heavily on the nature of the signal and also on the lineshape of your window function. In general, you might be better off trying to characterize your peak with a few points and then doing a curvefit of some kind. Are you trying to calculate Q of a resonant filter, by any chance?
If it's ok, you can do simple linear interpolation. Find segments where the drop occurs and calculate intermediate values. That will be no good, if you expect noise in the signal.
idx = find(magnitudes(2:end) <= (max(magnitudes)/2) & ...
magnitudes(1:end-1) >= (max(magnitudes)/2));
mag1 = magnitudes(idx); % magnitudes of points before drop
mag2 = magnitudes(idx+1); % magnitudes of points after drop below max/2
fr1 = frequencies(idx); % frequencies just before drop
fr2 = frequencies(idx+1); % frequencies after drop below max/2
magx = max(magnitudes)/2; % max/2
frx = (magx-mag2).*(fr1-fr2)./(mag1-mag2) + fr2; % estimated frequencies
You can also use INTERP1 function.