Row iteration in polars: fast calculation of exponentially weighted sum for irregularly sampled time series - python-polars

This can't be vectorised and we need to loop through the rows. I'm wondering if this can be done effectively in polars without casting. I see in the polars documentation for polars.DataFrame.rows it says:
Row-iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods.
In pandas/numpy, the fastest way I can imagine is to use numba, roughly like this:
import numba as nb
import numpy as np
#nb.jit(nopython=True)
def exponential_sum(signal, decay, initial_value=0):
n = len(signal)
exp_sum = np.zeros(n)
exp_sum[0] = signal[0]
for i in range(1, n):
exp_sum[i] = exp_sum[i-1] * decay[i] + signal[i]
return exp_sum
where decay = np.exp(times.diff() * alpha) (alpha controls the decay rate and the half-life).
I have tried casting to numpy and using numba as mentioned, but I'm wondering if there's a in-polars approach that is performant, and if not whether: it is just not implemented or it is a problem which polars is not well suited for due to the storage format.

Related

Applying k-folds (stratified 10-fold) to my text classification model

I need help in the following, I have a data frame with Columns: Class (0,1) and text.
After cleansing (lemmatizing, removing stopwords, etc), I split the data like the following:
#splitting datset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset['text_lemm_nostop'], dataset['class'],test_size=0.3, random_state=50)
Then I used n-gram:
#using n-gram
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(min_df=5,ngram_range=(2,2), max_features=5000).fit(X_train)
print('No. of features:')
len(vect.get_feature_names_out()) # how many features
Then I did the vectorizing:
X_train_vectorized=vect.transform(X_train)
X_test_vectorized=vect.transform(X_test)
Then I started applying ML algorithms like (logistic regression, Naive Bayes, RF,..etc)
and I will share only the logistic regression
#Logistic regression
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(X_train_vectorized, y_train)
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn import metrics
predictions=model.predict(vect.transform(X_test))
print("AUC Score is: ", roc_auc_score(y_test, predictions),'\n')
print("Accuracy:",metrics.accuracy_score(y_test, predictions),'\n')
print('Classification Report:\n',classification_report(y_test,predictions))
My Questions:
1. Is what I am doing is fine in case I am going with normal splitting (30% test)?
(I feel having issues with n-gram code!)
2. If I want to engage the K-fold cross-validation (ie. Stratified 10-fold), how could I do that in my code?
Appreciate any help and support!!
A few comments:
Generally I don't see any obvious mistake, seems fine to me.
Do you really mean to use only bigrams (n=2)? This implies that you don't use unigrams (single words), so you might miss some information this way. It's more common to use ngram_range=(1,2) for instance. It's probably worth trying unigrams only, at least to compare the performance.
To perform 10-fold cross-validation, I'd suggest you use KFold because it's the simplest method. In this case it's almost the same code, except that you insert it inside a loop which iterates the 10 folds. Every time you fit the model on a specific training set (90%) and evaluate it on the remaining 10% test set. Both training and test set are obtained by selecting the corresponding indexes indexes returned by KFold.

Feature Selection in Multivariate Linear Regression

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
a = make_regression(n_samples=300,n_features=5,noise=5)
df1 = pd.DataFrame(a[0])
df1 = pd.concat([df1,pd.DataFrame(a[1].T)],axis=1,ignore_index=True)
df1.rename(columns={0:"X1",1:"X2",2:"X3",3:"X4",4:"X5",5:"Target"},inplace=True)
sns.heatmap(df1.corr(),annot=True);
Correlation Matrix
Now I can ask my question. How can I choose features that will be included in the model?
I am not that well-versed in python as I use R most of the time.
But it should be something like this:
# Create a model
model = LinearRegression()
# Call the .fit method and pass in your data
model.fit(Variables,Target)
# Or simply do
model = LinearRegression().fit(Variables,Target)
# So based on the dataset head provided, it should be
X<-df1[['X1','X2','X3','X4','X5']]
Y<-df1['Target']
model = LinearRegression().fit(X,Y)
In order to do feature selections. You need to run the model first. Then check for the p-value. Typically, a p-value of 5% (.05) or less is a good cut-off point. If the p-value crosses the upper threshold of .05, the variable is insignificant and you can remove it from your model. You will have to do this manually. You can also tell by looking from the correlation matrix to see which value has less correlation to the target. AFAIK, there are no libs with built-in functionality to do feature selection automatically. In the end, statistics are just numbers. It is up to humans to interpret the results.

sklearn minmaxscaler ported to a different notebook

How would I go about downloading the min_max_scaler attributes so that I could apply the same transform to data within a different notebook?
For full disclosure I've trained a NN within one notebook, and am running it in a different locations. It is simple for me to load the trained weights of the NN in the second location, but I need to scale the data before inputting it into the model. To be accurate I believe it has to use the original scale attributes.
Per the documentation, you can recreate what min max scaler does using
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
where X is your original dataset. (Although as long as your feature range is the default of (0,1), the second line above is not needed - you will come out with X_scaled = X_std)
If you want to do this same computation using your already trained MaxMinScaler instead of your original dataset, consider the following example (again assuming feature range is left at the default (0,1))
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
# Test data set
X = pd.DataFrame(np.random.randint(0, 100, size=(20,4)))
# Test scaler
scaler = MinMaxScaler()
sklearn_result = scaler.fit_transform(X)
# Compute, and verify results match up to machine precision
manual_result = (X - scaler.data_min_)/(scaler.data_max_ - scaler.data_min_)
(sklearn_result - test).max().max() . # Is around 10e-16

Do I have to preprocess test data using neural networks?

I am using Keras (version 2.0.0) and I'd like to make use of pretrained models like e.g. VGG16.
In order to get started, I ran the example of the [Keras documentation site ][https://keras.io/applications/] for extracting features with VGG16:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
model = VGG16(weights='imagenet', include_top=False)
img_path = 'elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
The used preprocess_input() function bothers me
(the function does Zero-centering by mean pixel what can be seen by looking at the source code).
Do I really have to preprocess input data (validation/test data) before using a trained model?
a)
If yes, one can conclude that you always have to be aware of what preprocessing steps have been performed during training phase?!
b)
If no: Does preprocessing of validation/test data cause a bias?
I appreciate your help.
Yes you should use the preprocessing step. You can retrain the model without it but the first layers will learn to center your datas so this is a waste of parameters.
If you do not recenter your performances will suffer.
Great thread on reddit : https://www.reddit.com/r/MachineLearning/comments/3q7pjc/why_is_removing_the_mean_pixel_value_from_each/

Theano -- Mean of squared gradients

In theano, given a batch cost cost with shape (batch_size,), it is easy to compute the gradient of the mean cost, as in T.grad(T.mean(cost,axis=0),p) with p being a parameter used in the computation of cost. This is done efficiently by backpropagating the gradient through the computational graph. What I would now like to do is to compute the mean of the squared gradients over the batch. This can be done using the following piece of code:
import theano.tensor as T
g_square = T.mean(theano.scan(lambda i:T.grad(cost[i],p)**2,sequences=T.arange(cost.shape[0]))[0],axis=0)
Where for convenience p is assumed to be a single theano tensor and not a list of tensors.
The computation could be performed efficiently by simply backpropagating the gradient until the last step, and squaring the components of the last operation (which should be a sum over the batch index). I might be wrong on this one, but the computation should be as easy, and nearly as fast as a simple backpropagation. However, theano seems unable to optimize the computation, and it keeps using a loop, making computations extremely slow.
Would anyone know of a solution to make the computation efficient, either by forcing optimizations, expressing the computation in a different way, or even going through the backpropagation process ?
Thanks in advance.
Your function g_square happens to have complexity O(batch_size**2) instead of O(batch_size) as expected. This lets it appear incredibly slow for larger batch sizes.
The reason is because in every iteration the forward and backward pass is computed over the whole batch, even though just cost[i] for one data point is needed.
I assume the input to the cost computation graph, x, is a tensor with the first dimension of size batch_size. Theano has no means to automatically slice this tensor along this dimension. Therefore computation is always done over the whole batch.
Unfortunately I see no better solution than slicing your input and doing the loop outside Theano:
# x: input data batch
batch_size = x.shape[0]
g_square_fun = theano.function( [p], T.grad(cost[0],p)**2)
g_square_value = 0
for i in batch_size:
g_square_value += g_square_fun( x[i:i+1])
Perhaps when future versions of Theano come with better build in capabilities to compute Jacobians there will be more elegant solutions.
After digging deeper in the Theano docs I found a solution that would work in the compute graph. Key idea is that you clone the graph of your network inside the scan function, thereby explicitly slicing the input tensor. I tried the following code and empirically it shows O(batch_size) as expected:
# x: input data batch
# assuming cost = network(x,p)
from theano.gof.graph import clone_get_equiv
def g_square(cost,p):
g = T.zeros_like(p)
def scan_fn( i, g, cost, p):
# clone the graph computing cost, but slice it's input
cloned = clone_get_equiv([],[cost],
copy_inputs_and_orphans=False,
memo={x: x[i:i+1]})
cost_slice = cloned[cost].reshape([])
return g+T.grad(cost_slice,p)**2
result,updates = theano.reduce( scan_fn,
outputs_info=g,
sequences=[T.arange(cost.size)],
non_sequences=[cost.flatten(),p])
return result