I have a simple linear multiple regression in Python that looks like this:
X_train,X_test,y_train,y_test=train_test_split(x_cols,df['Volume'],test_size=0.15)
regr = LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
How do I plot the residuals of this model?
At first I tried this:
sns.residplot(y_pred, y_test)
But I'm not sure if this is actually displaying the residuals of the linear regression. Do I have the right arguments passed to residplot?
Nope, you need to pass your x and y as arguments and residplot will run the regression and plot the residuals.
You can read more about residplot here:
df = pd.DataFrame({
'X':np.random.randn(60),
'Y':np.random.randn(60),
})
sns.residplot('X','Y',data=df)
Related
I have 7 classes within my training examples (labeled 1-7). I'm running logistic regression and I want to create my ROC curve for each of my classes.
To train my model and make a prediction, I have the following code:
Theta = zeros(k, n+1); %initialize theta
[Theta, costs] = gradientDescent(Theta, #(t)(CostFunc(t, X, Y, lambda)),...
#(t)(DerivOfCostFunc(t, X, Y, lambda)), alpha, iter_num);
%Make prediction with trained model
[scores,prediction] = predict(Theta, X_test); %X_test is the design matrix (ones on the first col)
Within the predict script, I have
scores = g(X*all_theta'); %this is the sigmoid function
[p_max, IndexOfMax]=max(scores, [], 2);
prediction = IndexOfMax;
Note that scores is a m by k matrix, where m is the number of training examples and k is the number of classes. Prediction is a m by 1 vector with numbers going from 1-7, based on the predicted class.
To create the ROC curve, for class 3 for example,
classNum=3;
for i=1:size(scores,1)
temp=scores(i,:);
diffscore(i,:)=temp(classNum)-max([temp(:,1:classNum-1),temp(:,classNum+1:end)]);
end
This last part I did because I read that I had to establish my class 3 as positive and the others as negative.
At last, I made my curve with the following code:
[xROC,yROC,~,auc] = perfcurve(y_test,diffscore,classNum);
%y_test contains my true labels, m by 1 column vector
However, when running the ROC curve for each of my classes, I get the same plot for all. They all have an AUC of 1. Based on some analysis, I know this is not correct but can't figure out in which part of the code I went wrong! Is there additional code I should add or should I need to modify any of my existing code?
I have to implement an SVM classifier that recognizes labels.
The code is that:
function[Y_SVM_test] = getSVM(x,y,z, labels)
%matrix that contain x,y,z
X = [];
%vector of labels
Y = [];
X = [X; x y z];
Y = [Y; labels];
cv = cvpartition(length(X),'holdout',0.2);
% Training set
Xtrain = X(training(cv),:);
Ytrain = Y(training(cv));
% Test set
Xtest = X(test(cv),:);
Ytest = Y(test(cv));
tic
mySVM = fitcecoc(Xtrain,Ytrain);
toc
Y_SVM_test = predict(mySVM,Xtest);
end
With the function fitcecoc the execution never ends, I used it incorrectly? I tried to use also the function fitcsvm, which seems more specific from the documentation, but the error I get is the following: Error using ClassificationSVM.prepareData (line 686) You can not train an SVM model for more than 2 classes.
In general I have not understood well what is the best way to run SVM in Matlab. Can someone help me?
Your code looks good to me. When you say it never ends, I would guess you just haven't waited long enough. If your dataset is fairly large, fitting an ECOC SVM model can take a long time.
Using fitcecoc is the right way to fit a multiclass SVM model. SVMs by themselves are only a two-class model, which is fitted by fitcsvm. To fit a multiclass model, a wrapper is needed. ECOC is such a wrapper - what it does it to take each class, and separately fit a two-class model for that class against all the others. That's why it can take so long - it needs to fit multiple models, one for each class.
PS: you don't need X = []; and then X = [X; x y z];. Just say X = [x y z], it has the same effect. Similarly, just say Y = labels.
i would like to implement logistic regression in matlab, i have following few code for this
function B=logistic_regression(x,y)
f=#(a)(sum(y.*log((exp(a(1)+a(2)*x)/(1+exp(a(1)+a(2)*x))))+(1-y).*log((1-((exp(a(1)+a(2)*x)/(1+exp(a(1)+a(2)*x))))))));
a=[0.1, 0.1];
options = optimset('PlotFcns',#optimplotfval);
B = fminsearch(f,a, options);
end
logistic regression is following :
first we are calculating logit which is equal to
L=b0+b1*x
then we are calculating probability which is equal to
p=e^L/(1+e^L)
and finally we are calculating
y*ln(p)+(1-y)*ln(1-p)
i decided to write all those stuff in one line, but when i am running code , it gives me following error
>> B=logistic_regression(x,y)
Assignment has more non-singleton rhs dimensions than non-singleton subscripts
Error in fminsearch (line 200)
fv(:,1) = funfcn(x,varargin{:});
Error in logistic_regression (line 6)
B = fminsearch(f,a, options);
how can i fix this problem? thanks in advance
In order to implement a logistic regression model, I usually call the glmfit function, which is the simpler way to go. The syntax is:
b = glmfit(x,y,'binomial','link','logit');
b is a vector that contains the coefficients for the linear portion of the logistic regression (the first element is the constant term alpha of the regression). x contains the predictors data, with one row for each observation and one column for each variable. y contains the target variable, usually a vector of boolean (0 or 1) values representing the outcome.
Once you obtain the coefficients, you have to apply the linear part of the regression to your predictors:
z = b(1) + (x * b(2));
To finish, you must apply the logistic function to the output of the linear part:
z = 1 ./ (1 + exp(-z));
If you need more tinkering on your data or on your output, and you require more flexibility and control over your model, I suggest you to look at this implementation:
https://github.com/mohammadaltaleb/Logistic-Regression
I want to write a bimodal Probability Density Function (PDF with multiple peaks, Galtung S) without using the pdf function from statistics toolbox. Here is my code:
x = 0:0.01:5;
d = [0.5;2.5];
a = [12;14]; % scale parameter
y = 2*a(1).*(x-d(1)).*exp(-a(1).*(x-d(1)).^2) + ...
2*a(2).*(x-d(2)).*exp(-a(2).*(x-d(2)).^2);
plot(x,y)
Here's the curve.
plot(x,y)
I would like to change the mathematical formula to to get rid of the dips in the curve that appear at approx. 0<x<.5 and 2<x<2.5.
Is there a way to implement x>d(1) and x>d(2) in line 4 of the code to avoid y < 0? I would not want to solve this with a loop because I need to convert the formula to CDF later on.
If you want to plot only for x>max(d1,d2), you can use logical indexing:
plot(x(x>max(d)),y(x>max(d)))
If you to plot for all x but plot max(y,0), you just can write so:
plot(x,max(y,0))
Does MatLab have any built in function to evaluate the density of a random variable from a custom histogram? (I suspect there are probably lots of ways to do this, I am just looking to see if there is already any builtin MatLab functionality).
Thanks.
The function hist gives you an approximation of the probability density you are evaluating.
If you want a continuous representation of it, this article from the Matlab documentation explains how to get one using the spline command from the Curve Fitting Toolbox. Basically the article explains how to make a cubic interpolation of your histogram.
The resulting code is :
y = randn(1,5001); % Replace y by your own dataset
[heights,centers] = hist(y);
hold on
n = length(centers);
w = centers(2)-centers(1);
t = linspace(centers(1)-w/2,centers(end)+w/2,n+1);
dt = diff(t);
Fvals = cumsum([0,heights.*dt]);
F = spline(t, [0, Fvals, 0]);
DF = fnder(F);
hold on
fnplt(DF, 'r', 2)
hold off
ylims = ylim;
ylim([0,ylims(2)]);
A popular way is to use kernel density estimation. The simplest way to do this in Matlab is using ksdensity.