How can I edit code to work on my data set instead of movie_reviews corpora for NB classifier? - classification

I am trying to train the Naive Bayes Classifier with my training data sets which have been classified into positive and negative tweets manually.
I have found plenty of code that trains using the movie_reviews corpus or similar type dataset, but not one in which there are only 2 files, one negative, one positive.
Example code I found:
import string
from nltk.corpus import LazyCorpusLoader,
CategorizedPlaintextCorpusReader
from nltk.corpus import stopwords
my_movie_reviews = LazyCorpusLoader('my_movie_reviews', CategorizedPlaintextCorpusReader, r'(?!\.).*\.txt',
cat_pattern=r'(neg|pos)/.*', encoding='ascii')
mr = my_movie_reviews
stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and
w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
for i in documents:
print i
My problem is in the one-liner loop statement. I dont have to deal with fileid in my program, since I have only one file in each category. How can I edit that statement?
My corpus:
nltk.data/corpora/my_corpus/negative/negative_tweets.txt - category 1
nltk.data/corpora/my_corpus/positive/positive_tweets.txt - category 2

Related

How to model tf-idf spark

I'm trying to re-write a code wrote (that it's in Python), but now in spark.
#pandas
tfidf = TfidfVectorizer()
df_final = np.array(tfidf.fit_transform(df['sentence']).todense())
I read on spark documentation, is it necessary to use Tokenizer, HashingTF and then IDF to model tf-idf in PySpark?
#pyspark
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
tokenizer = Tokenizer(inputCol = "sentence", outputCol = "words")
wordsData = tokenizer.transform(df)
hashingTF = HashingTF(inputCol = "words", outputCol="rawFeatures", numFeatures = 20)
tf = hashingTF.transform(wordsData)
idf = IDF(inputCol = "rawFeatures", outputCol = "features")
tf_idf = idf.fit(tf)
df_final = tf_idf.transform(tf)
I'm not sure if you understand clearly how tf-idf model works, since tokenizing is essential and fundamental for tf-idf model no matter in sklearn or spark.ml version. You post actually cover 2 questions:
Why tf-idf need to tokenization the sentence: I won't copy the mathematical equation since it's easy to search in google. Long in short, tf-idf is a statistical measurement to evaluate the relevancy and relationship between a word to a document in a collection of documents, which is calculated by the how frequent a word appear in a document (tf) and the inverse frequency of the word across a set of documents (idf). Therefore, as the essence is the vocabulary and all calculation are based on vocaulary, if your input is sentence like what you mentioned in your sklearn version, you must do the tokenizing of the sentence before the calculation, otherwise the whole methodology is not valid anymore.
How tf-idf work in sklearn: If you understand how tf-idf works, then you should understand the different steps in the example of spark official document are essential. Thanks for the sklearn developer to create such convenient API, you can use the .fit_transform() directly with the Series of sentence. In fact, if you check the source code of the TfidfVectorizer in sklearn, you can see that it actually did the "tokenization", just in a different way:
It inherits the from the CountVectorizer (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1717)
It uses the ._count_vocab() method in CountVectorizer to transform your sentence. (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1338)
In ._count_vocab(), it checks each sentences and create the sparse matrix to store the frequency of each vocabulary in each sentences before the tf-idf calculation. (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1192)
To conclude, tokenizing the sentence is essential for the tf-idf model calculation, the example in spark official documents is efficient enough for your model building. Remember to use the function or method if spark provide such API and DON'T try to build the user defined function/class to achieve the same goal, otherwise it may reduce your computing performance or trigger other issue like out-of-memory.

Applying k-folds (stratified 10-fold) to my text classification model

I need help in the following, I have a data frame with Columns: Class (0,1) and text.
After cleansing (lemmatizing, removing stopwords, etc), I split the data like the following:
#splitting datset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset['text_lemm_nostop'], dataset['class'],test_size=0.3, random_state=50)
Then I used n-gram:
#using n-gram
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(min_df=5,ngram_range=(2,2), max_features=5000).fit(X_train)
print('No. of features:')
len(vect.get_feature_names_out()) # how many features
Then I did the vectorizing:
X_train_vectorized=vect.transform(X_train)
X_test_vectorized=vect.transform(X_test)
Then I started applying ML algorithms like (logistic regression, Naive Bayes, RF,..etc)
and I will share only the logistic regression
#Logistic regression
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(X_train_vectorized, y_train)
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn import metrics
predictions=model.predict(vect.transform(X_test))
print("AUC Score is: ", roc_auc_score(y_test, predictions),'\n')
print("Accuracy:",metrics.accuracy_score(y_test, predictions),'\n')
print('Classification Report:\n',classification_report(y_test,predictions))
My Questions:
1. Is what I am doing is fine in case I am going with normal splitting (30% test)?
(I feel having issues with n-gram code!)
2. If I want to engage the K-fold cross-validation (ie. Stratified 10-fold), how could I do that in my code?
Appreciate any help and support!!
A few comments:
Generally I don't see any obvious mistake, seems fine to me.
Do you really mean to use only bigrams (n=2)? This implies that you don't use unigrams (single words), so you might miss some information this way. It's more common to use ngram_range=(1,2) for instance. It's probably worth trying unigrams only, at least to compare the performance.
To perform 10-fold cross-validation, I'd suggest you use KFold because it's the simplest method. In this case it's almost the same code, except that you insert it inside a loop which iterates the 10 folds. Every time you fit the model on a specific training set (90%) and evaluate it on the remaining 10% test set. Both training and test set are obtained by selecting the corresponding indexes indexes returned by KFold.

Feature Selection in Multivariate Linear Regression

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
a = make_regression(n_samples=300,n_features=5,noise=5)
df1 = pd.DataFrame(a[0])
df1 = pd.concat([df1,pd.DataFrame(a[1].T)],axis=1,ignore_index=True)
df1.rename(columns={0:"X1",1:"X2",2:"X3",3:"X4",4:"X5",5:"Target"},inplace=True)
sns.heatmap(df1.corr(),annot=True);
Correlation Matrix
Now I can ask my question. How can I choose features that will be included in the model?
I am not that well-versed in python as I use R most of the time.
But it should be something like this:
# Create a model
model = LinearRegression()
# Call the .fit method and pass in your data
model.fit(Variables,Target)
# Or simply do
model = LinearRegression().fit(Variables,Target)
# So based on the dataset head provided, it should be
X<-df1[['X1','X2','X3','X4','X5']]
Y<-df1['Target']
model = LinearRegression().fit(X,Y)
In order to do feature selections. You need to run the model first. Then check for the p-value. Typically, a p-value of 5% (.05) or less is a good cut-off point. If the p-value crosses the upper threshold of .05, the variable is insignificant and you can remove it from your model. You will have to do this manually. You can also tell by looking from the correlation matrix to see which value has less correlation to the target. AFAIK, there are no libs with built-in functionality to do feature selection automatically. In the end, statistics are just numbers. It is up to humans to interpret the results.

Orange3 concatenate tables with different targets

I have two input datafiles to use in orange, one corresponds to the train set (with targets "A", "B" and "C") and the other to the unknown samples ( with targets "D" and "E" to be able to identify the unknown samples in the scatterplot of the two first principal components).
I have applied PCA to the train dataset and through a python script i have reapplied the PCA transformation to the test dataset, however the result have a ? in the target value for all entries in the unknown samples set.
I have tried to merge the train and unknown samples sets with the merge table widget, and apparently it does the same, all samples in train are correct, but the unknown samples have ? as targets.
The only way i managed to have this running properly is to have unknown samples and train set on the same input file. Which is not practical for obvious reasons.
Is there any way to fix this?
Please note that i have tried to change the domain.class_var and the target value directly on the transformed unknown samples, but it also alters the domain of the train dataset. Apparently when the new table is created it just have a reference to the domain of the original train data after PCA.
I have managed it by converting the data into numpy arrays concatenate them and then back to table.
Here is the code if anyone is interested:
import numpy
from Orange.data.table import Table
from Orange.data import Domain, DiscreteVariable, ContinuousVariable
trnsfrmd_knwn_data = numpy.array(in_object)
trnsfrmd_unkwn_data = numpy.array(Table(in_object.domain,in_data))
ndx = list(set(trnsfrmd_knwn_data[:,len(trnsfrmd_knwn_data[0])-1].tolist()))[-1] + 1
trnsfrmd_unkwn_data[:,len(trnsfrmd_knwn_data[0])-1] = numpy.array([i for i in range(0, len(trnsfrmd_unkwn_data))]) + ndx
targets = in_object.domain.class_var.values + in_data.domain.class_var.values
dm = Domain([ContinuousVariable(x.name) for x in in_object.domain.attributes], DiscreteVariable('region', values=targets))
out_data = Table.from_numpy(dm, numpy.append(trnsfrmd_knwn_data,trnsfrmd_unkwn_data,axis=0))

Text Categorization datasets for MATLAB

I am looking for a reliable dataset for Text categorization tasks in MATLAB format.
I want to run some experiments and don't want to spend too much time in preprocessing the text and creating feature vectors. I need something to be ready so I can plug it in my algorithm. I found a MATLAB files for reuters dataset here: link text
Everything is ready in here, but I want to use a subset of this. In this "fea" contains the feature vectors for each document. However, it seems that it is not a normal matrix. I want for example to select the top 1000 documents in this "fea". If you just download it and load it into MATLAB you will see what I mean.
So, If it is possible I need a solution for the above-mentioned dataset or any alternative datasets.
Thanks in advance.
It is stored as sparse matrix. Extract the first 1000 documents (rows), and if you have enough space, you can convert it to full dense matrix:
load Reuters21578.mat
TF = full( fea(1:1000,:) );
Lets check the variables we have:
>> whos
Name Size Bytes Class Attributes
TF 1000x18933 151464000 double
fea 8293x18933 4749196 double sparse
gnd 8293x1 66344 double
testIdx 2347x1 18776 double
trainIdx 5946x1 47568 double
so you can see TF is now about 150MB.
Other than that, the rest is self-explanatory:
fea: term-frequency matrix, rows are documents, columns are terms
gnd: category of each document, where numel(unique(gnd)) == 65
trainIdx/testIdx: split of instances (documents) for classification purposes, contains indices of rows, used as: tr = fea(trainIdx,:); tt = fea(testIdx,:);