Feature Selection in Multivariate Linear Regression - linear-regression

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
a = make_regression(n_samples=300,n_features=5,noise=5)
df1 = pd.DataFrame(a[0])
df1 = pd.concat([df1,pd.DataFrame(a[1].T)],axis=1,ignore_index=True)
df1.rename(columns={0:"X1",1:"X2",2:"X3",3:"X4",4:"X5",5:"Target"},inplace=True)
sns.heatmap(df1.corr(),annot=True);
Correlation Matrix
Now I can ask my question. How can I choose features that will be included in the model?

I am not that well-versed in python as I use R most of the time.
But it should be something like this:
# Create a model
model = LinearRegression()
# Call the .fit method and pass in your data
model.fit(Variables,Target)
# Or simply do
model = LinearRegression().fit(Variables,Target)
# So based on the dataset head provided, it should be
X<-df1[['X1','X2','X3','X4','X5']]
Y<-df1['Target']
model = LinearRegression().fit(X,Y)
In order to do feature selections. You need to run the model first. Then check for the p-value. Typically, a p-value of 5% (.05) or less is a good cut-off point. If the p-value crosses the upper threshold of .05, the variable is insignificant and you can remove it from your model. You will have to do this manually. You can also tell by looking from the correlation matrix to see which value has less correlation to the target. AFAIK, there are no libs with built-in functionality to do feature selection automatically. In the end, statistics are just numbers. It is up to humans to interpret the results.

Related

Row iteration in polars: fast calculation of exponentially weighted sum for irregularly sampled time series

This can't be vectorised and we need to loop through the rows. I'm wondering if this can be done effectively in polars without casting. I see in the polars documentation for polars.DataFrame.rows it says:
Row-iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods.
In pandas/numpy, the fastest way I can imagine is to use numba, roughly like this:
import numba as nb
import numpy as np
#nb.jit(nopython=True)
def exponential_sum(signal, decay, initial_value=0):
n = len(signal)
exp_sum = np.zeros(n)
exp_sum[0] = signal[0]
for i in range(1, n):
exp_sum[i] = exp_sum[i-1] * decay[i] + signal[i]
return exp_sum
where decay = np.exp(times.diff() * alpha) (alpha controls the decay rate and the half-life).
I have tried casting to numpy and using numba as mentioned, but I'm wondering if there's a in-polars approach that is performant, and if not whether: it is just not implemented or it is a problem which polars is not well suited for due to the storage format.

Applying k-folds (stratified 10-fold) to my text classification model

I need help in the following, I have a data frame with Columns: Class (0,1) and text.
After cleansing (lemmatizing, removing stopwords, etc), I split the data like the following:
#splitting datset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset['text_lemm_nostop'], dataset['class'],test_size=0.3, random_state=50)
Then I used n-gram:
#using n-gram
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(min_df=5,ngram_range=(2,2), max_features=5000).fit(X_train)
print('No. of features:')
len(vect.get_feature_names_out()) # how many features
Then I did the vectorizing:
X_train_vectorized=vect.transform(X_train)
X_test_vectorized=vect.transform(X_test)
Then I started applying ML algorithms like (logistic regression, Naive Bayes, RF,..etc)
and I will share only the logistic regression
#Logistic regression
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(X_train_vectorized, y_train)
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn import metrics
predictions=model.predict(vect.transform(X_test))
print("AUC Score is: ", roc_auc_score(y_test, predictions),'\n')
print("Accuracy:",metrics.accuracy_score(y_test, predictions),'\n')
print('Classification Report:\n',classification_report(y_test,predictions))
My Questions:
1. Is what I am doing is fine in case I am going with normal splitting (30% test)?
(I feel having issues with n-gram code!)
2. If I want to engage the K-fold cross-validation (ie. Stratified 10-fold), how could I do that in my code?
Appreciate any help and support!!
A few comments:
Generally I don't see any obvious mistake, seems fine to me.
Do you really mean to use only bigrams (n=2)? This implies that you don't use unigrams (single words), so you might miss some information this way. It's more common to use ngram_range=(1,2) for instance. It's probably worth trying unigrams only, at least to compare the performance.
To perform 10-fold cross-validation, I'd suggest you use KFold because it's the simplest method. In this case it's almost the same code, except that you insert it inside a loop which iterates the 10 folds. Every time you fit the model on a specific training set (90%) and evaluate it on the remaining 10% test set. Both training and test set are obtained by selecting the corresponding indexes indexes returned by KFold.

sklearn minmaxscaler ported to a different notebook

How would I go about downloading the min_max_scaler attributes so that I could apply the same transform to data within a different notebook?
For full disclosure I've trained a NN within one notebook, and am running it in a different locations. It is simple for me to load the trained weights of the NN in the second location, but I need to scale the data before inputting it into the model. To be accurate I believe it has to use the original scale attributes.
Per the documentation, you can recreate what min max scaler does using
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
where X is your original dataset. (Although as long as your feature range is the default of (0,1), the second line above is not needed - you will come out with X_scaled = X_std)
If you want to do this same computation using your already trained MaxMinScaler instead of your original dataset, consider the following example (again assuming feature range is left at the default (0,1))
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
# Test data set
X = pd.DataFrame(np.random.randint(0, 100, size=(20,4)))
# Test scaler
scaler = MinMaxScaler()
sklearn_result = scaler.fit_transform(X)
# Compute, and verify results match up to machine precision
manual_result = (X - scaler.data_min_)/(scaler.data_max_ - scaler.data_min_)
(sklearn_result - test).max().max() . # Is around 10e-16

Do I have to preprocess test data using neural networks?

I am using Keras (version 2.0.0) and I'd like to make use of pretrained models like e.g. VGG16.
In order to get started, I ran the example of the [Keras documentation site ][https://keras.io/applications/] for extracting features with VGG16:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
model = VGG16(weights='imagenet', include_top=False)
img_path = 'elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
The used preprocess_input() function bothers me
(the function does Zero-centering by mean pixel what can be seen by looking at the source code).
Do I really have to preprocess input data (validation/test data) before using a trained model?
a)
If yes, one can conclude that you always have to be aware of what preprocessing steps have been performed during training phase?!
b)
If no: Does preprocessing of validation/test data cause a bias?
I appreciate your help.
Yes you should use the preprocessing step. You can retrain the model without it but the first layers will learn to center your datas so this is a waste of parameters.
If you do not recenter your performances will suffer.
Great thread on reddit : https://www.reddit.com/r/MachineLearning/comments/3q7pjc/why_is_removing_the_mean_pixel_value_from_each/

Getting values in Seaborn boxplot

I would like to get the specific values by a boxplot generated in Seaborn
(i.e., media, quartile). For example, in the boxplot below (source: link)
Is there a any way to get the media and quartiles instead of manually estimation?
import numpy as np
import seaborn as sns
sns.set(style="ticks", palette="muted", color_codes=True)
# Load the example planets dataset
planets = sns.load_dataset("planets")
# Plot the orbital period with horizontal boxes
ax = sns.boxplot(x="distance", y="method", data=planets,
whis=np.inf, color="c")
I would encourage you to become familiar with using pandas to extract quantitative information from a dataframe. For instance, a simple thing you could to do to get the values you are looking for (and other useful ones) would be:
planets.groupby("method").distance.describe().unstack()
which prints a table of useful values for each method.
Or if you just want the median:
planets.groupby("method").distance.median()
Sometimes I use my data as a list of arrays instead of pandas. So for that, you might need:
min(d), np.quantile(d, 0.25), np.median(d), np.quantile(d, 0.75), max(d)