Why is linear regression wrong for pyspark? - pyspark

I kept getting wrong answers so I tried it on something very, very basic, and it was still wrong.
input file:
1 1:1
2 1:2
3 1:3
4 1:4
from pyspark.ml.regression import LinearRegression
# Load training data
training = spark.read.format("libsvm").load("stupid.txt")
lr = LinearRegression(maxIter=100, regParam=0.3, loss='squaredError')
# Fit the model
lrModel = lr.fit(training)
# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))
# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
Should have gotten coefficients [1] and intercept 0.
Instead got
Coefficients: [0.7884394856681294]
Intercept: 0.52890128583

It looks like the issue is the regParam parameter that you’re using. If I run it with that set to 0, which causes normal OLS to take place, we get the expected output:
Code:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.linalg import Vectors
training = spark.createDataFrame([
(1.0, Vectors.dense(1.0)),
(2.0, Vectors.dense(2.0)),
(3.0, Vectors.dense(3.0)),
(4.0, Vectors.dense(4.0))], ["label", "features"])
lr = LinearRegression(maxIter=100, regParam=0, loss='squaredError')
# Fit the model
lrModel = lr.fit(training)
# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))
# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
Output:
Coefficients: [1.0]
Intercept: 0.0
numIterations: 1
objectiveHistory: [0.0]
+---------+
|residuals|
+---------+
| 0.0|
| 0.0|
| 0.0|
| 0.0|
+---------+
RMSE: 0.000000
r2: 1.000000
It seems like the regParam > 0 is being used as an L2 regularization term, and preventing the model from performing a normal OLS process.

Related

How to split the dataset in a stratified way (Matlab)?

I have a 30x500 data matrix and a 3x500 target matrix. This is a classification problem. I need to divide the data into training, validation and testing (80%, 10%, 10%), but I want to maintain the proportion of each class in the divided data. How can I do this in Matlab?
Edit:
The target matrix contains the labels (one hot) of the correct class (there are three classes)
|0 0 1 ... 1|
|1 0 0 ... 0|
|0 1 0 ... 0|3x500
The data matrix contains 500 samples with 30 predictor variables (30x500).
|2 0 1 4 8 1 ... 2|
|4 1 5 8 7 3 ... 0|
|1 3 6 4 2 1 ... 6|
|. . . . . . . . .|
|3 5 8 4 0 0. .. 1| 30x500
You can manage to inmpose the number of each class by computing the percentiles associated to the cumulative probabilities, and then associate each interval to the associated interval. Let's use this strategy to first create a random dataset containing exactly 50% of class 1, 30% of class 2, and 20% of class 3 (not that you don't have to do that in your case as you already have the class matrix classmat and data matrix datamat):
clc; clear all; close all;
% Parameters
c = 3; % number of classes
d = 500; % number of data
o = 30; % number of data for each observation
propd = [0.4, 0.2, 0.4]; % proportions of each class in the original data (size 1xc)
% Generation of fake data
datamat = randi([0,15],o,d); % test data matrix
propd_os = cumsum([0, propd]); propd_os(end) = 1;
randmat = rand(d,1);
classmat = zeros(c,d); % test class matrix
for i=1:c
prctld = prctile(randmat, 100*propd_os); prctld(1) = 0; prctld(end) = 1;
classmat(i,randmat>=prctld(i) & randmat<prctld(i+1)) = 1;
end
% Proportions of the original data
disp(['original data proportions:' sprintf(' %.3f',sum(classmat,2)/d)])
When executed, this code creates the classmat matrix and displays the proportions of each class in this matrix:
>>>
original data proportions: 0.400 0.200 0.400
I created for you a script to split this dataset into parts that respects the same proportions of the original dataset:
%% Splitting parameters
s = 3; % number of parts
props = [0.8, 0.1, 0.1]; % proportions of each splitted datasets (size 1xs)
props_os = cumsum([0, props]); props_os(end) = 1;
randmat = rand(1,d);
splitmat = zeros(s,d); % split matrix (one hot)
for i=1:c
indc = classmat(i,:)==1;
prctls = prctile(randmat(indc), 100*props_os); prctls(1) = 0; prctls(end) = 1;
for j=1:s
inds = randmat>=prctls(j) & randmat<prctls(j+1);
splitmat(j,indc&inds) = 1;
end
end
% Proportions of classes in each split parts
disp(['original split proportions:' sprintf(' %.3f',sum(splitmat,2)/d)])
for j=1:s
inds = splitmat(j,:)==1;
disp([sprintf('part %d proportions:', j) sprintf(' %.3f',sum(classmat(:,inds),2)/sum(inds))])
end
You obtain with this 3 parts containing 80%, 10% and 10%. Each of these have the same proportion of each class than the original dataset:
>>>
original split proportions: 0.800 0.100 0.100
part 1 proportions: 0.400 0.200 0.400
part 2 proportions: 0.400 0.200 0.400
part 3 proportions: 0.400 0.200 0.400
Note that you cannot always obtain the exact proportions beacause it depends on the size of the datasets and their divisibility by the inverses of the proportions... But I think it should do what you want. Please do not hesitate to ask question if you have troubles with the code. At the end you obtain a one hot splitmat matrix. splitmat(s,d) equals 1 if datapoint d belongs to part s.

How to do nonlinear data-fitting a function on the experiment data

I have some experiment data. Hereby, I need to fit the following function to determine one of the variable. A Levenberg–Marquardt least-squares algorithm was used in this procedure.
I have used curve fitting option in Igor Pro software. I defined new fit function and tried to define independent and dependent variable.
Nevertheless, I don't know what is the reason that I got the this error:
"The fitting function returned INF for at least one X variable"
My function is :
sin(theta) = -1+2*sqrt(alpha/x)*exp(-beta*(x-alpha)^2)
beta = 1.135e-4;
sin(theta) = [-0.81704 -0.67649 -0.83137 -0.73468 -0.66744 -0.43602 0.45368 0.75802 0.96705 0.99717 ]
x = [72.01 59.99 51.13 45.53 36.15 31.66 30.16 29.01 25.62 23.47 ]
Is there any suggestion to find alpha variable here?
Is there any handy software or program for nonlinear curve fitting?
In gnuplot, it would look like this. The fit is not great, but that's not the "fault" of gnuplot, but apparently this data cannot be fitted with this function very well.
Code:
### nonlinear curve fitting
reset session
$Data <<EOD
72.01 -0.81704
59.99 -0.67649
51.13 -0.83137
45.53 -0.73468
36.15 -0.66744
31.66 -0.43602
30.16 0.45368
29.01 0.75802
25.62 0.96705
23.47 0.99717
EOD
f(x) = -1+2*sqrt(alpha/x)*exp(-beta*(x-alpha)**2)
# initial guessed values
alpha = 25
beta = 1
set fit nolog results
fit f(x) $Data u 1:2 via alpha,beta
plot $Data u 1:2 w lp pt 7, \
f(x) lc rgb "red"
print sprintf("alpha=%g, beta=%g",alpha,beta)
### end of code
Result:
alpha=25.818, beta=0.0195229
If it might be of some use, my equation search on your data turned up a good fit to a standard 4-parameter logistic equation "y = d + (a - d) / (1.0 + pow(x / c, b))" with parameters a = 0.96207949, b = 44.14292256, c = 30.67324939, and d = -0.74830947 yielding RMSE = 0.0565 and R-squared = 0.9943, and I have included code for a Python graphical fitter using this equation.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
theta = [-0.81704, -0.67649, -0.83137, -0.73468, -0.66744, -0.43602, 0.45368, 0.75802, 0.96705, 0.99717]
x = [72.01, 59.99, 51.13, 45.53, 36.15, 31.66, 30.16, 29.01, 25.62, 23.47]
# rename to match previous example code
xData = numpy.array(x)
yData = numpy.array(theta)
# StandardLogistic4Parameter equation from zunzun.com
def func(x, a, b, c, d):
return d + (a - d) / (1.0 + numpy.power(x / c, b))
# these are the same as the scipy defaults
initialParameters = numpy.array([1.0, 1.0, 1.0, 1.0])
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
modelPredictions = func(xData, *fittedParameters)
absError = modelPredictions - yData
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(yData))
print('Parameters:', fittedParameters)
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
print()
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(xData, yData, 'D')
# create data for the fitted equation plot
xModel = numpy.linspace(min(xData), max(xData))
yModel = func(xModel, *fittedParameters)
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
Matlab
I slightly changed the function, -1 changed to -gamma and optimize to find gamma
The code is as follow
ydata = [-0.81704 -0.67649 -0.83137 -0.73468 -0.66744 -0.43602 0.45368...
0.75802 0.96705 0.99717 ];
xdata = [72.01 59.99 51.13 45.53 36.15 31.66 30.16 29.01 25.62 23.47 ];
sin_theta = #(alpha, beta, gamma, xdata) -gamma+2.*sqrt(alpha./xdata).*exp(beta.*(xdata-alpha).^2);
%Fitting function as function of array(x) required by lsqcurvefit
f = #(x,xdata) sin_theta(x(1),x(2), x(3),xdata);
% [alpha, beta, gamma]
x0 = [25, 0, 1] ;
options = optimoptions('lsqcurvefit','Algorithm','levenberg-marquardt', 'FunctionTolerance', 1e-30);
[x,resnorm,residual,exitflag,output] = lsqcurvefit(f,x0,xdata,ydata,[], [], options);
% Accuracy
RMSE = sqrt(sum(residual.^2)/length(residual));
alpha = x(1); beta = x(2); gamma = x(3);
%Plotting data
data = linspace(xdata(1),xdata(end));
plot(xdata,ydata,'ro',data,f(x,data),'b-', 'linewidth', 3)
legend('Data','Fitted exponential')
title('Data and Fitted Curve')
set(gca,'FontSize',20)
Result
alpha = 26.0582, beta = -0.0329, gamma = 0.7881 instead of 1, RMSE = 0.1498
Graph

How to achieve 0 training error when using one-hidden-layer neural network with random inputs?

In theory, one-hidden-layer neural network with m hidden nodes can be trained by gradient descent to fit n data points with 0 training error, where m >= n.
I have 100 data points (x, y), x in R and y in R, no specific pattern, just random. And I was using one-hidden-layer neural network with 1000/2000/10000/... hidden nodes to fit those points (with stochastic gradient descent and ReLU).
But I can't achieve that. Any idea what's the problem here?
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Activation
from keras.optimizers import SGD
from keras import initializers
## initializing x_train and y_train randomly ##
def f1(x):
if x < 3:
return np.abs(x-1)
else:
return -np.abs(x-1)+4
n = 100
x_train = np.random.uniform(-4+1, 4+1, size = n)
e = np.random.normal(0, 0.5, size = n)
y_train = np.vectorize(f1)(x_train) + e
np.random.shuffle(y_train)
k = 10000 # number of hidden nodes
ep = 5
loss = []
model = Sequential()
model.add(Dense(k, kernel_initializer = 'random_normal', input_shape = (1,), use_bias=True))
model.add(Activation('relu'))
model.add(Dense(1, kernel_initializer = 'random_normal', use_bias=True))
#sgd = SGD(lr=0.00005, decay=1e-6, momentum=0.9)
sgd = SGD(lr=0.00008)
model.compile(loss='mse', optimizer=sgd, metrics = ['mse'])
for i in range(5000):
H = model.fit(x_train, y_train, epochs=ep, verbose=False)
wt = model.get_weights()
temp = H.history['mean_squared_error'][-1]
print(temp)
loss.append(temp)
image
What is your loss function? Can you show your code and perhaps some printouts of the loss per training epoch? How are you initializing the parameters of those hidden nodes (also do the nnn/nnnn/nnnn in your description mean those are different experimental settings?)?

PyTorch will not fit straight line to two data points

I'm facing issues in fitting a simple y= 4x1 line with 2 data points using pytorch. While running the inference code, the model seems to output same value to any input which is strange. Pls find the code attached along with the data files used by me. Appreciate any help here.
import torch
import numpy as np
import pandas as pd
df = pd.read_csv('data.csv')
test_data = pd.read_csv('test_data.csv')
inputs = df[['x1']]
target = df['y']
inputs = torch.tensor(inputs.values).float()
target = torch.tensor(target.values).float()
test_data = torch.tensor(test_data.values).float()
#Defining Network Architecture
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net,self).__init__()
hidden1 = 3
# hidden2 = 5
self.fc1 = nn.Linear(1,hidden1)
self.fc3 = nn.Linear(hidden1,1)
def forward(self,x):
x = F.relu(self.fc1(x))
x = self.fc3(x)
return x
#instantiate the model
model = Net()
print(model)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(),lr=0.01)
model.train()
#epochs
epochs = 100
for x in range(epochs):
#initialize the training loss to 0
train_loss = 0
#clear out gradients
optimizer.zero_grad()
#calculate the output
output = model(inputs)
#calculate loss
loss = criterion(output,target)
#backpropagate
loss.backward()
#update parameters
optimizer.step()
if ((x%5)==0):
print('Training Loss after epoch {:2d} is {:2.6f}'.format(x,loss))
#set the model in evaluation mode
model.eval()
#Test the model on unseen data
test_output = model(test_data)
print(test_output)
Below is the model output
#model output
tensor([[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579],
[56.7579]], grad_fn=<AddmmBackward>)
Your model is collapsing. You can probably see that based on the prints. You may want to use a lower learning rate (1e-5, 1e-6, etc.). Switching from SGD(...)to Adam(...) may be easier if you do not have experience and want less trouble fine-tuning these hparams. Also, maybe 100 epochs is not enough. As you did not share an MCVE, I cannot tell you for sure what it is. Here is an MCVE of linefitting using the same Net you used:
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
epochs = 1000
max_range = 40
interval = 4
# DATA
x_train = torch.arange(0, max_range, interval).view(-1, 1).float()
x_train += torch.rand(x_train.size(0), 1) - 0.5 # small noise
y_train = (4 * x_train)
y_train += torch.rand(x_train.size(0), 1) - 0.5 # small noise
x_test = torch.arange(interval // 2, max_range, interval).view(-1, 1).float()
y_test = 4 * x_test
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
hidden1 = 3
self.fc1 = nn.Linear(1, hidden1)
self.fc3 = nn.Linear(hidden1, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc3(x)
return x
model = Net()
print(model)
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-5)
# TRAIN
model.train()
for epoch in range(epochs):
optimizer.zero_grad()
y_pred = model(x_train)
loss = criterion(y_pred, y_train)
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print('Training Loss after epoch {:2d} is {:2.6f}'.format(epoch, loss))
# TEST
model.eval()
y_pred = model(x_test)
print(torch.cat((x_test, y_pred, y_test), dim=-1))
This is what the data looks like:
And this is what the training looks like:
Training Loss after epoch 0 is 7416.805664
Training Loss after epoch 10 is 6645.655273
Training Loss after epoch 20 is 5792.936523
Training Loss after epoch 30 is 4700.106445
Training Loss after epoch 40 is 3245.384277
Training Loss after epoch 50 is 1779.370728
Training Loss after epoch 60 is 747.418579
Training Loss after epoch 70 is 246.781311
Training Loss after epoch 80 is 68.635155
Training Loss after epoch 90 is 17.332235
Training Loss after epoch 100 is 4.280161
Training Loss after epoch 110 is 1.170808
Training Loss after epoch 120 is 0.453974
...
Training Loss after epoch 970 is 0.232296
Training Loss after epoch 980 is 0.232090
Training Loss after epoch 990 is 0.231888
And this is what the output looks like:
| x_test | y_pred | y_test |
|:-------:|:--------:|:--------:|
| 2.0000 | 8.6135 | 8.0000 |
| 6.0000 | 24.5276 | 24.0000 |
| 10.0000 | 40.4418 | 40.0000 |
| 14.0000 | 56.3303 | 56.0000 |
| 18.0000 | 72.1884 | 72.0000 |
| 22.0000 | 88.0465 | 88.0000 |
| 26.0000 | 103.9047 | 104.0000 |
| 30.0000 | 119.7628 | 120.0000 |
| 34.0000 | 135.6210 | 136.0000 |
| 38.0000 | 151.4791 | 152.0000 |

Storing p values using Econometrics toolbox in matlab

This code will run with the econometrics toolbox,
model = arima('Constant',0.5,'AR',{0.9999},'Variance',.4);
rng('default')
Y = simulate(model,50);
figure
plot(Y)
xlim([0,50])
title('Simulated AR(1) Process')
rng('default')
Y = simulate(model,50,'NumPaths',1000);
Y1=Y(:,1);
for ii = 1:50
Mdl = arima(1,0,0);
EstMdl = estimate(Mdl, [Y(:,ii)]);
end
How can I store the p-values from the EstMdl for each iteration (i.e. a vector with 5 pvalues) ?
Use summarize (requires ≥ R2018a) to get the results of estimate.
Showing the results for one iteration here:
>> ii=1;
>> Mdl = arima(1,0,0);
>> EstMdl = estimate(Mdl, [Y(:,ii)]);
ARIMA(1,0,0) Model (Gaussian Distribution):
Value StandardError TStatistic PValue
_______ _____________ __________ __________
Constant 622.14 427.99 1.4536 0.14605
AR{1} 0.87561 0.085586 10.231 1.4432e-24
Variance 0.37122 0.079507 4.669 3.0263e-06
>> Results = summarize(EstMdl);
>> PValues = Results.Table.PValue
PValues =
0.1460
0.0000
0.0000