Density-Based Clustering Validation (DBCV) never stops running - cluster-analysis

I have completed running DBSCAN on a dataset of mine clustering patches of deforestation and I am attempting to validate the results according to this paper.
I have install the package from this Github, but when I try and run the code it never completes. I ran it for over a 5 days and it never stopped running or threw an error. Running DBSCAN only took 15 minutes so I am a little confused why just the validating is taking so long. Is there something I'm getting wrong with the DBCV code or the inputs?
Since it never finishes running the code I don't know of an error that I can report. I am unsure if I'm inputting the data into the code correctly, but I tried to copy the example on GitHub as closely as I could. I don't know how to share my .csv file to show what my file is like. It has 16 dimensions that I consense down using a MinMaxScaler before running DBSCAN. I have previously completed the DBSCAN clustering and an just trying to get the DBCV to work.
import pandas as pd
import numpy as np
from pylab import rcParams
import matplotlib.pyplot as plot
import sklearn
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import MinMaxScaler
from scipy.spatial import euclidean
from DBCV import DBCV
f = pd.read_csv('csv_file_I_Don't_know_how_to_share')
x = f.loc[:, [1-15]].values
norm_data = MinMaxScaler()
data = norm_data.fit_transform(x)
dbscan = DBSCAN(eps=.15, min_samples = 100)
clusters = dbscan.fit_predict(data)
DBCV_score = DBCV(data, clusters, dist_function=euclidean)
print ('DBCV Score: ' + DBCV_score)
I'm expecting a score to be printed but instead the code continues to run and doesn't stop. Any help would be great!

You run:
from scipy.spatial import euclidean
But the code on GitHub defines the method to use euclidean imported like this:
from scipy.spatial.distance import euclidean
Try changing this, it might work.

In addition to the answer by #Dumbfool there seems to be an error in:
print('DBCV Score: ' + DBCV_score)
Try changing the + to ,
I hope this helps.

Related

Graph missing from output

The following code works, when I run it via Jupyter through Anaconda, yet when I do the same through VSC although gives me no error messages the output is missing the graph and only shows the top "name" segment, see picture. I would really like to it to work on VSC, any ideas why it isnt working? I have tried everything, reinstalling, updating, switching interpreters/kernels etc and nothing works.
import lumicks.pylake as lk
import matplotlib.pyplot as plt
import glob
import numpy as np
%matplotlib notebook
import lumicks.pylake as lk
#lk.pytest()
pip install scipy
fdcurves = {}
for filename in
glob.glob('C:/Users/paulv/OneDrive/Desktop/20220321_BNT_groupA/20220321-140249 FD Curve B7 1nM DNA Fwd1.h5'):#Control curves
file = lk.File(filename)
key, curve = file.fdcurves.popitem()
fdcurves[key] = curve
fdcurves
selector = lk.FdDistanceRangeSelector(fdcurves)
plt.show()
Thank you in advance !!

AffinityPropagation with Sentence Transformar not converging

I'm trying to cluster a text dataset using Sentence Transformer and Affinity Propagation but I keep getting "ConvergenceWarning: Affinity propagation did not converge, this model will not have any cluster centers." and, therefore, no clustering. I'm trying to figure it out for some time but couldn't make it work and there is few documentation online for text clustering using this algorithm.
from sklearn.cluster import AffinityPropagation
import numpy as np
model = SentenceTransformer('sentence-transformers/paraphrase-xlm-r-multilingual-v1')
sentence_embeddings = model.encode(texts)
affprop = AffinityPropagation()
af = affprop.fit(sentence_embeddings)
cluster_centers_indices = af.cluster_centers_indices_
len(cluster_centers_indices) # this line returns zero
Did someone get this same problem and have some workaround to suggest?

Applying k-folds (stratified 10-fold) to my text classification model

I need help in the following, I have a data frame with Columns: Class (0,1) and text.
After cleansing (lemmatizing, removing stopwords, etc), I split the data like the following:
#splitting datset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset['text_lemm_nostop'], dataset['class'],test_size=0.3, random_state=50)
Then I used n-gram:
#using n-gram
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(min_df=5,ngram_range=(2,2), max_features=5000).fit(X_train)
print('No. of features:')
len(vect.get_feature_names_out()) # how many features
Then I did the vectorizing:
X_train_vectorized=vect.transform(X_train)
X_test_vectorized=vect.transform(X_test)
Then I started applying ML algorithms like (logistic regression, Naive Bayes, RF,..etc)
and I will share only the logistic regression
#Logistic regression
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(X_train_vectorized, y_train)
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn import metrics
predictions=model.predict(vect.transform(X_test))
print("AUC Score is: ", roc_auc_score(y_test, predictions),'\n')
print("Accuracy:",metrics.accuracy_score(y_test, predictions),'\n')
print('Classification Report:\n',classification_report(y_test,predictions))
My Questions:
1. Is what I am doing is fine in case I am going with normal splitting (30% test)?
(I feel having issues with n-gram code!)
2. If I want to engage the K-fold cross-validation (ie. Stratified 10-fold), how could I do that in my code?
Appreciate any help and support!!
A few comments:
Generally I don't see any obvious mistake, seems fine to me.
Do you really mean to use only bigrams (n=2)? This implies that you don't use unigrams (single words), so you might miss some information this way. It's more common to use ngram_range=(1,2) for instance. It's probably worth trying unigrams only, at least to compare the performance.
To perform 10-fold cross-validation, I'd suggest you use KFold because it's the simplest method. In this case it's almost the same code, except that you insert it inside a loop which iterates the 10 folds. Every time you fit the model on a specific training set (90%) and evaluate it on the remaining 10% test set. Both training and test set are obtained by selecting the corresponding indexes indexes returned by KFold.

Feature Selection in Multivariate Linear Regression

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
a = make_regression(n_samples=300,n_features=5,noise=5)
df1 = pd.DataFrame(a[0])
df1 = pd.concat([df1,pd.DataFrame(a[1].T)],axis=1,ignore_index=True)
df1.rename(columns={0:"X1",1:"X2",2:"X3",3:"X4",4:"X5",5:"Target"},inplace=True)
sns.heatmap(df1.corr(),annot=True);
Correlation Matrix
Now I can ask my question. How can I choose features that will be included in the model?
I am not that well-versed in python as I use R most of the time.
But it should be something like this:
# Create a model
model = LinearRegression()
# Call the .fit method and pass in your data
model.fit(Variables,Target)
# Or simply do
model = LinearRegression().fit(Variables,Target)
# So based on the dataset head provided, it should be
X<-df1[['X1','X2','X3','X4','X5']]
Y<-df1['Target']
model = LinearRegression().fit(X,Y)
In order to do feature selections. You need to run the model first. Then check for the p-value. Typically, a p-value of 5% (.05) or less is a good cut-off point. If the p-value crosses the upper threshold of .05, the variable is insignificant and you can remove it from your model. You will have to do this manually. You can also tell by looking from the correlation matrix to see which value has less correlation to the target. AFAIK, there are no libs with built-in functionality to do feature selection automatically. In the end, statistics are just numbers. It is up to humans to interpret the results.

Do I have to preprocess test data using neural networks?

I am using Keras (version 2.0.0) and I'd like to make use of pretrained models like e.g. VGG16.
In order to get started, I ran the example of the [Keras documentation site ][https://keras.io/applications/] for extracting features with VGG16:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
model = VGG16(weights='imagenet', include_top=False)
img_path = 'elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
The used preprocess_input() function bothers me
(the function does Zero-centering by mean pixel what can be seen by looking at the source code).
Do I really have to preprocess input data (validation/test data) before using a trained model?
a)
If yes, one can conclude that you always have to be aware of what preprocessing steps have been performed during training phase?!
b)
If no: Does preprocessing of validation/test data cause a bias?
I appreciate your help.
Yes you should use the preprocessing step. You can retrain the model without it but the first layers will learn to center your datas so this is a waste of parameters.
If you do not recenter your performances will suffer.
Great thread on reddit : https://www.reddit.com/r/MachineLearning/comments/3q7pjc/why_is_removing_the_mean_pixel_value_from_each/