95% confidence interval for AUROCC? - confidence-interval

How do I calculate the 95% confidence interval of the area under the ROC curve (AUROCC) on BlueSky Statistics? I know how to create the multivariate logistic model and show the ROC curve and AUROCC. I tried using the Bootstrap Resampling but could not figure out how to get the 95% confidence interval.

If you are familiar with R, you may want to enter the corresponding R codes to perform the calculation you are looking for (see chapter 4 of the user manual for V10 for further details on using R code with BlueSky)
For example, I was able to use some R code to directly connect to MySQL as well as to open a SQLite database file.

Related

plot 95% confidence interval of 20 data points in Matlab

If amount of data points is small (i.e. 20 data points). Do I need to check its normality before calculating confidence interval?
Can anyone suggests a rigorous process to plot 95% confidence interval?
Thanks!
With so few data points and if you don't know if the distribution is normal I would look into using bootstrap confidence interval. It's a non-parametric method so you are not assuming normality. The MATLAB function bootci implements this method.
Here is the documentation for bootci.

Plotting Acoustic Waveform - Magnitude on a Linear Frequency Scale

I have an acoustic waveform of a Spanish phoneme and I'd like to compute its magnitude spectrum and plot it in dB magnitude on a linear frequency scale. How would I be able to accomplish this in MATLAB?
Thanks
First a quick heads up: At stackoverflow you are expected to show some of your own efforts to solve the problem and then ask for help.
Now to your question:
You can plot the spectrogram using the "spectrogram" Matlab function.
[s,f,t] = spectrogram(x,window,noverlap,f,fs)
Check the details here: https://www.mathworks.com/help/signal/ref/spectrogram.html
For a speech signal you will want to specify the sampling frequency "fs" (you can get that when you read the file using:
[y,Fs] = audioread(filename)
You will probably want to specify the variables "window" and "noverlap" since speech signals can show distinct properties depending on the dimension of the window (fast phenomena will not be visible on big windows ). A typical values are 20ms windows with 10ms overlap (select the best value by considering your sampling frequency and the nearest 2^n value for fast Fourier calculation).
The window size and overlap are also valid when you calculate spectrum. If you apply FFT to the whole waveform then you will get the "average" spectral information for the sentence. To catch specific phenomena you must use windowing techniques and perform a short-term Fourier analysis.
use sptool
Signal Processing toolbox Show its Document

About conversion to dB

So I have been dealing with the hydrophone data and coding in matlab. I have couple of questions as follows -
1. I read all the .wav files in matlab and trying to compute the Power spectral density using pwelch in matlab which im able to do with ease. My question is conversion to dB. The hydrophone output is in Voltage/Micropascal. So in order to obtain the Power spectral density values in dB do I simply do 120 + 10*log10(Power)? This is what i did and the Power in brackets here is the output of using pwelch and the values are like 10 power of -6..
How do I get sound pressure levels(SPL (dB re 1 micropascal) from the power spectral density values in dB?
How do I chose specific frequency band of interest, say I want to look at power spectral density in the frequency 10 to 12kHz then how do i go about?
Many thanks .

Beginners issue in polynomial curve fitting [Part 1]

I have just started understanding modeling techniques based on regression models and was going through MATLAB curve fitting toolbox and the SO. I have fundamental doubts and unable to proceed further. I have a single vector set with k=100 data points which I want to fit into an AR model,MA model,ARMA model successively to see which is better suited.Starting with an AR(p) model of the form y(k+1)=a*y(k)+ b*y(k-1)The command
coeff = polyfit(x,y,d)
will fit a polynomial of degree say d=1 with p number of coefficients indicating the order of the model (AR(p)). But I just have 1 set of data which is the recording of the angular moment.So,what will go as the first parameter (x) of the function signature i.e what will be x,y?Then, what if the linear models are not good enough so I may have to select the nonlinear models.Can somebody please guide with code snippets what are the steps in fitting,checking for overfitting,residual calculation etc.
x is likely to be k (index of y). And the whole code:
c =polyfit(1:length(y), y, d).
Matlab has a curve fitting toolbox. You could use it to check different nonlinear fitting in GUI to get some intuition.
If you want steps there's a great Coursera Machine Learning course. The beginning of this course is related to linear regression and I recommend you to spend some hours at least on that beginning.

SVM Classification with Cross Validation

I am new to using Matlab and am trying to follow the example in the Bioinformatics Toolbox documentation (SVM Classification with Cross Validation) to handle a classification problem.
However, I am not able to understand Step 9, which says:
Set up a function that takes an input z=[rbf_sigma,boxconstraint], and returns the cross-validation value of exp(z).
The reason to take exp(z) is twofold:
rbf_sigma and boxconstraint must be positive.
You should look at points spaced approximately exponentially apart.
This function handle computes the cross validation at parameters
exp([rbf_sigma,boxconstraint]):
minfn = #(z)crossval('mcr',cdata,grp,'Predfun', ...
#(xtrain,ytrain,xtest)crossfun(xtrain,ytrain,...
xtest,exp(z(1)),exp(z(2))),'partition',c);
What is the function that I should be implementing here? Is it exp or minfn? I will appreciate if you can give me the code for this section. Thanks.
I will like to know what does it mean when it says exp([rbf_sigma,boxconstraint])
rbf_sigma: The svm is using a gaussian kernel, the rbf_sigma set the standard deviation (~size) of the kernel. To understand how kernels work, the SVM is putting the kernel around every sample (so that you have a gaussian around every sample). Then the kernels are added up (sumed) for the samples of each category/type. At each point the type which sum is higher would be the "winner". For example if type A has a higher sum of these kernels at point X, then if you have a new datum to classify in point X, it will be classified as type A. (there are other configuration parameters that may change the actual threshold where a category is selected over another)
Fig. Analyze this figure from the webpage you gave us. You can see how by adding up the gaussian kernels on the red samples "sumA", and on the green samples "sumB"; it is logical that sumA>sumB in the center part of the figure. It is also logical that sumB>sumA in the outer part of the image.
boxconstraint: it is a cost/penalty over miss-classified data. During the training stage of the classifier, where you use the training data to adjust the SVM parameters, the training algorithm is using an error function to decide how to optimize the SVM parameters in an iterative fashion. The cost for a miss-classified sample is proportional to how far it is from the boundary where it would have been classified correctly. In the figure that I am attaching the boundary is the inner blue circumference.
Taking into account BGreene indications and from what I understand of the tutorial:
In the tutorial they advice to try values for rbf_sigma and boxconstraint that are exponentially apart. This means that you should compare values like {0.2, 2, 20, ...} (note that this is {2*10^(i-2), i=1,2,3,...}), and NOT like {0.2, 0.3, 0.4, 0.5} (which would be linearly apart). They advice this to try a wide range of values first. You can further optimize later FROM the first optimum that you obtained before.
The command "[searchmin fval] = fminsearch(minfn,randn(2,1),opts)" will give you back the optimum values for rbf_sigma and boxconstraint. Probably you have to use exp(z) because it affects how fminsearch increments the values of z(1) and z(2) during the search for the optimum value. I suppose that when you put exp(z(1)) in the definition of #minfn, then fminsearch will take 'exponentially' big steps.
In machine learning, always try to understand that there are three subsets in your data: training data, cross-validation data, and test data. The training set is used to optimize the parameters of the SVM classifier for EACH value of rbf_sigma and boxconstraint. Then the cross validation set is used to select the optimum value of the parameters rbf_sigma and boxconstraint. And finally the test data is used to obtain an idea of the performance of your classifier (the efficiency of the classifier is determined upon the test set).
So, if you start with 10000 samples you may divide the data for example as training(50%), cross-validation(25%), test(25%). So that you will sample randomly 5000 samples for the training set, then 2500 samples from the 5000 remaining samples for the cross-validation set, and the rest of samples (that is 2500) would be separated for the test set.
I hope that I could clarify your doubts. By the way, if you are interested in the optimization of the parameters of classifiers and machine learning algorithms I strongly suggest that you follow this free course -> www.ml-class.org (it is awesome, really).
You need to implement a function called crossfun (see example).
The function handle minfn is passed to fminsearch to be minimized.
exp([rbf_sigma,boxconstraint]) is the quantity being optimized to minimize classification error.
There are a number of functions nested within this function handle:
- crossval is producing the classification error based on cross validation using partition c
- crossfun - classifies data using an SVM
- fminsearch - optimizes SVM hyperparameters to minimize classification error
Hope this helps