Applying k-folds (stratified 10-fold) to my text classification model - classification

I need help in the following, I have a data frame with Columns: Class (0,1) and text.
After cleansing (lemmatizing, removing stopwords, etc), I split the data like the following:
#splitting datset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset['text_lemm_nostop'], dataset['class'],test_size=0.3, random_state=50)
Then I used n-gram:
#using n-gram
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(min_df=5,ngram_range=(2,2), max_features=5000).fit(X_train)
print('No. of features:')
len(vect.get_feature_names_out()) # how many features
Then I did the vectorizing:
Then I started applying ML algorithms like (logistic regression, Naive Bayes, RF,..etc)
and I will share only the logistic regression
#Logistic regression
from sklearn.linear_model import LogisticRegression
model=LogisticRegression(), y_train)
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn import metrics
print("AUC Score is: ", roc_auc_score(y_test, predictions),'\n')
print("Accuracy:",metrics.accuracy_score(y_test, predictions),'\n')
print('Classification Report:\n',classification_report(y_test,predictions))
My Questions:
1. Is what I am doing is fine in case I am going with normal splitting (30% test)?
(I feel having issues with n-gram code!)
2. If I want to engage the K-fold cross-validation (ie. Stratified 10-fold), how could I do that in my code?
Appreciate any help and support!!

A few comments:
Generally I don't see any obvious mistake, seems fine to me.
Do you really mean to use only bigrams (n=2)? This implies that you don't use unigrams (single words), so you might miss some information this way. It's more common to use ngram_range=(1,2) for instance. It's probably worth trying unigrams only, at least to compare the performance.
To perform 10-fold cross-validation, I'd suggest you use KFold because it's the simplest method. In this case it's almost the same code, except that you insert it inside a loop which iterates the 10 folds. Every time you fit the model on a specific training set (90%) and evaluate it on the remaining 10% test set. Both training and test set are obtained by selecting the corresponding indexes indexes returned by KFold.


Feature Selection in Multivariate Linear Regression

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
a = make_regression(n_samples=300,n_features=5,noise=5)
df1 = pd.DataFrame(a[0])
df1 = pd.concat([df1,pd.DataFrame(a[1].T)],axis=1,ignore_index=True)
Correlation Matrix
Now I can ask my question. How can I choose features that will be included in the model?
I am not that well-versed in python as I use R most of the time.
But it should be something like this:
# Create a model
model = LinearRegression()
# Call the .fit method and pass in your data,Target)
# Or simply do
model = LinearRegression().fit(Variables,Target)
# So based on the dataset head provided, it should be
model = LinearRegression().fit(X,Y)
In order to do feature selections. You need to run the model first. Then check for the p-value. Typically, a p-value of 5% (.05) or less is a good cut-off point. If the p-value crosses the upper threshold of .05, the variable is insignificant and you can remove it from your model. You will have to do this manually. You can also tell by looking from the correlation matrix to see which value has less correlation to the target. AFAIK, there are no libs with built-in functionality to do feature selection automatically. In the end, statistics are just numbers. It is up to humans to interpret the results.

Is there any plan to implement complex survey design within Spark's MLLIB package?

I'm working to implement a logistic regression in Pyspark that is currently written in SAS using proc surveylogistic. The SAS implementation is able to account for complex survey design involving clusters/strata/sample weights.
There are some avenues out there for at least getting the model into Python: for example, I was able to get a close match of both coefficients and standard errors using the statsmodels package from this research project on Github
However, my data is big and so I'd like to take advantage of Spark's distributed capabilities through the MLLIB package. For example, the current setup to run the logit in Spark is:
import as ft
featuresCreator = ft.VectorAssembler(
inputCols = X_features_list,
outputCol = "features")
glm_binomial = GeneralizedLinearRegression(family="binomial", link="logit", maxIter=25, regParam = 0,
from import Pipeline
pipeline = Pipeline(stages=[featuresCreator, glm_binomial])
model =
The "weightcol" works for just simple sample weights, but I'm wondering if anyone is aware of a method for implementing a more complex weighting scheme in Spark (note that the above would match a proc logistic, not a proc surveylogistic). For comparison, the method used to calculate the covariance matrix in the surveylogistic is here.

How can I edit code to work on my data set instead of movie_reviews corpora for NB classifier?

I am trying to train the Naive Bayes Classifier with my training data sets which have been classified into positive and negative tweets manually.
I have found plenty of code that trains using the movie_reviews corpus or similar type dataset, but not one in which there are only 2 files, one negative, one positive.
Example code I found:
import string
from nltk.corpus import LazyCorpusLoader,
from nltk.corpus import stopwords
my_movie_reviews = LazyCorpusLoader('my_movie_reviews', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt',
cat_pattern=r'(neg|pos)/.*', encoding='ascii')
mr = my_movie_reviews
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and
w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
for i in documents:
print i
My problem is in the one-liner loop statement. I dont have to deal with fileid in my program, since I have only one file in each category. How can I edit that statement?
My corpus: - category 1 - category 2

Do I have to preprocess test data using neural networks?

I am using Keras (version 2.0.0) and I'd like to make use of pretrained models like e.g. VGG16.
In order to get started, I ran the example of the [Keras documentation site ][] for extracting features with VGG16:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
model = VGG16(weights='imagenet', include_top=False)
img_path = 'elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
The used preprocess_input() function bothers me
(the function does Zero-centering by mean pixel what can be seen by looking at the source code).
Do I really have to preprocess input data (validation/test data) before using a trained model?
If yes, one can conclude that you always have to be aware of what preprocessing steps have been performed during training phase?!
If no: Does preprocessing of validation/test data cause a bias?
I appreciate your help.
Yes you should use the preprocessing step. You can retrain the model without it but the first layers will learn to center your datas so this is a waste of parameters.
If you do not recenter your performances will suffer.
Great thread on reddit :

Reporting log-likelihood / perplexity of spark LDA model (different in local vs distributed models?)

Given a training corpus docsWithFeatures, I've trained an LDA model in Spark (via Scala API) like so:
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel, LocalLDAModel}
val n_topics = 10;
val lda = new LDA().setK(n_topics).setMaxIterations(20)
val ldaModel =
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]
And now I want to report the log-likelihood and perplexity of the model.
I can get the log-likelihood like so:
scala> distLDAModel.logLikelihood
res11: Double = -2600097.2875547716
But this is where things get weird. I also wanted the perplexity, which is only implemented for a local model, so I run:
val localModel = distLDAModel.toLocal
Which lets me get the (log) perplexity like so:
scala> localModel.logPerplexity(docsWithFeatures)
res14: Double = 0.36729132682898674
But the local model also supports the log-likelihood calculation, which I run like this:
scala> localModel.logLikelihood(docsWithFeatures)
res15: Double = -3672913.268234148
So what's going on here? Shouldn't the two log-likelihood values be the same? The documentation for a distributed model says
"logLikelihood: log likelihood of the training corpus, given the inferred topics and document-topic distributions"
while for a local model it says:
"logLikelihood(documents): Calculates a lower bound on the provided documents given the inferred topics."
I guess these are different, but it's not clear to me how or why. Which one should I use? That is, which one is the "true" likelihood of the model, given the training documents?
To summarize, two main questions:
1 - How and why are the two log-likelihood values different, and which should I use?
2 - When reporting perplexity, am I correct in thinking that I should use the exponential of the logPerplexity result? (But why does the model give log perplexity instead of just plain perplexity? Am I missing something?)
1) These two log-likelihood values differ because they are computing the log-likelihood for two different models. DistributedLDAModel is effectively computing the log-likelihood w.r.t. a model where the parameters for the topics and the mixing weights for each of the documents are constants (as I mentioned in another post, the DistributedLDAModel is essentially regularized PLSI, though you need to use logPrior to also account for the regularization), while the LocalLDAModel takes the view that the topic parameters as well as the mixing weights for each document are random variables. So in the case of LocalLDAModel you have to integrate (marginalize) out the topic parameters and document mixing weights in order to compute the log-likelihood (and this is what makes the variational approximation/lower bound necessary, though even without the approximation the log-likelihoods would not be the same since the models are just different.)
As far as which one you should use, my suggestion (without knowing what you ultimately want to do) would be to go with the log-likelihood method attached to the class you originally trained (i.e. the DistributedLDAModel.) As a side note, the primary (only?) reason that I can see to convert a DistributedLDAModel into a LocalLDAModel via toLocal is to enable the computation of topic mixing weights for a new (out-of-training) set of documents (for more info on this see my post on this thread: Spark MLlib LDA, how to infer the topics distribution of a new unseen document?), a operation which is not (but could be) supported in DistributedLDAModel.
2) log-perplexity is just the negative log-likelihood divided by the number of tokens in your corpus. If you divide the log-perplexity by math.log(2.0) then the resulting value can also be interpreted as the approximate number of bits per a token needed to encode your corpus (as a bag of words) given the model.