Is there any plan to implement complex survey design within Spark's MLLIB package? - pyspark

I'm working to implement a logistic regression in Pyspark that is currently written in SAS using proc surveylogistic. The SAS implementation is able to account for complex survey design involving clusters/strata/sample weights.
There are some avenues out there for at least getting the model into Python: for example, I was able to get a close match of both coefficients and standard errors using the statsmodels package from this research project on Github
However, my data is big and so I'd like to take advantage of Spark's distributed capabilities through the MLLIB package. For example, the current setup to run the logit in Spark is:
import pyspark.ml.feature as ft
featuresCreator = ft.VectorAssembler(
inputCols = X_features_list,
outputCol = "features")
glm_binomial = GeneralizedLinearRegression(family="binomial", link="logit", maxIter=25, regParam = 0,
labelCol='df',
weightCol='wgt_panel')
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[featuresCreator, glm_binomial])
model = pipeline.fit(encoded_df_nonan)
The "weightcol" works for just simple sample weights, but I'm wondering if anyone is aware of a method for implementing a more complex weighting scheme in Spark (note that the above would match a proc logistic, not a proc surveylogistic). For comparison, the method used to calculate the covariance matrix in the surveylogistic is here.

Related

Use own dataset in Denoising Diffusion Implicit Models

Can you tell me how I create and implement my own dataset in this code?
def prepare_dataset(split):
# the validation dataset is shuffled as well, because data order matters
# for the KID estimation
return (
tfds.load(dataset_name, split=split, shuffle_files=True)
.map(preprocess_image, num_parallel_calls=tf.data.AUTOTUNE)
.cache()
.repeat(dataset_repetitions)
.shuffle(10 * batch_size)
.batch(batch_size, drop_remainder=True)
.prefetch(buffer_size=tf.data.AUTOTUNE)
)
# load dataset
train_dataset = prepare_dataset("train[:80%]+validation[:80%]+test[:80%]")
val_dataset = prepare_dataset("train[80%:]+validation[80%:]+test[80%:]")
It always redirects me to use the TensorFlow datasets. It's from the Keras Notebook 'Denoising Diffusion Implicit Models'.
I tried following the keras guide on how to create your own dataset, but I'm not sure how to implement it in this code. I also tried this way of creating your own dataset, but I don't know if it's possible to implement it into the code above.

How to model tf-idf spark

I'm trying to re-write a code wrote (that it's in Python), but now in spark.
#pandas
tfidf = TfidfVectorizer()
df_final = np.array(tfidf.fit_transform(df['sentence']).todense())
I read on spark documentation, is it necessary to use Tokenizer, HashingTF and then IDF to model tf-idf in PySpark?
#pyspark
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
tokenizer = Tokenizer(inputCol = "sentence", outputCol = "words")
wordsData = tokenizer.transform(df)
hashingTF = HashingTF(inputCol = "words", outputCol="rawFeatures", numFeatures = 20)
tf = hashingTF.transform(wordsData)
idf = IDF(inputCol = "rawFeatures", outputCol = "features")
tf_idf = idf.fit(tf)
df_final = tf_idf.transform(tf)
I'm not sure if you understand clearly how tf-idf model works, since tokenizing is essential and fundamental for tf-idf model no matter in sklearn or spark.ml version. You post actually cover 2 questions:
Why tf-idf need to tokenization the sentence: I won't copy the mathematical equation since it's easy to search in google. Long in short, tf-idf is a statistical measurement to evaluate the relevancy and relationship between a word to a document in a collection of documents, which is calculated by the how frequent a word appear in a document (tf) and the inverse frequency of the word across a set of documents (idf). Therefore, as the essence is the vocabulary and all calculation are based on vocaulary, if your input is sentence like what you mentioned in your sklearn version, you must do the tokenizing of the sentence before the calculation, otherwise the whole methodology is not valid anymore.
How tf-idf work in sklearn: If you understand how tf-idf works, then you should understand the different steps in the example of spark official document are essential. Thanks for the sklearn developer to create such convenient API, you can use the .fit_transform() directly with the Series of sentence. In fact, if you check the source code of the TfidfVectorizer in sklearn, you can see that it actually did the "tokenization", just in a different way:
It inherits the from the CountVectorizer (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1717)
It uses the ._count_vocab() method in CountVectorizer to transform your sentence. (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1338)
In ._count_vocab(), it checks each sentences and create the sparse matrix to store the frequency of each vocabulary in each sentences before the tf-idf calculation. (https://github.com/scikit-learn/scikit-learn/blob/36958fb240fbe435673a9e3c52e769f01f36bec0/sklearn/feature_extraction/text.py#L1192)
To conclude, tokenizing the sentence is essential for the tf-idf model calculation, the example in spark official documents is efficient enough for your model building. Remember to use the function or method if spark provide such API and DON'T try to build the user defined function/class to achieve the same goal, otherwise it may reduce your computing performance or trigger other issue like out-of-memory.

Applying k-folds (stratified 10-fold) to my text classification model

I need help in the following, I have a data frame with Columns: Class (0,1) and text.
After cleansing (lemmatizing, removing stopwords, etc), I split the data like the following:
#splitting datset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset['text_lemm_nostop'], dataset['class'],test_size=0.3, random_state=50)
Then I used n-gram:
#using n-gram
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(min_df=5,ngram_range=(2,2), max_features=5000).fit(X_train)
print('No. of features:')
len(vect.get_feature_names_out()) # how many features
Then I did the vectorizing:
X_train_vectorized=vect.transform(X_train)
X_test_vectorized=vect.transform(X_test)
Then I started applying ML algorithms like (logistic regression, Naive Bayes, RF,..etc)
and I will share only the logistic regression
#Logistic regression
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(X_train_vectorized, y_train)
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn import metrics
predictions=model.predict(vect.transform(X_test))
print("AUC Score is: ", roc_auc_score(y_test, predictions),'\n')
print("Accuracy:",metrics.accuracy_score(y_test, predictions),'\n')
print('Classification Report:\n',classification_report(y_test,predictions))
My Questions:
1. Is what I am doing is fine in case I am going with normal splitting (30% test)?
(I feel having issues with n-gram code!)
2. If I want to engage the K-fold cross-validation (ie. Stratified 10-fold), how could I do that in my code?
Appreciate any help and support!!
A few comments:
Generally I don't see any obvious mistake, seems fine to me.
Do you really mean to use only bigrams (n=2)? This implies that you don't use unigrams (single words), so you might miss some information this way. It's more common to use ngram_range=(1,2) for instance. It's probably worth trying unigrams only, at least to compare the performance.
To perform 10-fold cross-validation, I'd suggest you use KFold because it's the simplest method. In this case it's almost the same code, except that you insert it inside a loop which iterates the 10 folds. Every time you fit the model on a specific training set (90%) and evaluate it on the remaining 10% test set. Both training and test set are obtained by selecting the corresponding indexes indexes returned by KFold.

how to get the prediction of a model in pyspark

i have developed a clustering model using pyspark and i want to just predict the class of one vector and here is the code
spark = SparkSession.builder.config("spark.sql.warehouse.dir",
"file:///C:/temp").appName("Kmeans").getOrCreate()
vecAssembler = VectorAssembler(inputCols=FEATURES_COL, outputCol="features")
df_kmeans = vecAssembler.transform(df).select('LCLid', 'features')
k = 6
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
model = kmeans.fit(df_kmeans)
centers = model.clusterCenters()
predictions = model.transform(df_kmeans)
transformed = model.transform(df_kmeans).select('LCLid', 'prediction')
rows = transformed.collect()
say that i have a vector of features V and i want to predict in which class it belongs
i tried a method that i found in this link http://web.cs.ucla.edu/~zhoudiyu/tutorial/
but it doesn't work since i'm working with SparkSession not in sparkContext
I see that you dealt with the most basic steps in your model creation, what you still need is to apply your k-means model on the vector that you want to make the clustering on (like what you did in line 10) then get your prediction, I mean what you have to do is to reDo the same work done in line 10 but on the new vector of features V. To understand this more I invite you to read this posted answer in StackOveflow:
KMeans clustering in PySpark.
I want to add also that the problem in the example that you are following is not due to the use of SparkSession or SparkContext as those are just an entry point to the Spark APIs, you can also get access to a sparContext through a sparkSession since it is unified by Databricks since Spark 2.0. The pyspark k-means is like the Scikit learn the only difference is the predefined functions in spark python API (PySpark).
You can call the predict method of the kmeans model using a Spark ML Vector:
from pyspark.ml.linalg import Vectors
model.predict(Vectors.dense([1,0]))
Here [1,0] is just an example. It should have the same length as your feature vector.

Do I have to preprocess test data using neural networks?

I am using Keras (version 2.0.0) and I'd like to make use of pretrained models like e.g. VGG16.
In order to get started, I ran the example of the [Keras documentation site ][https://keras.io/applications/] for extracting features with VGG16:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
model = VGG16(weights='imagenet', include_top=False)
img_path = 'elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
The used preprocess_input() function bothers me
(the function does Zero-centering by mean pixel what can be seen by looking at the source code).
Do I really have to preprocess input data (validation/test data) before using a trained model?
a)
If yes, one can conclude that you always have to be aware of what preprocessing steps have been performed during training phase?!
b)
If no: Does preprocessing of validation/test data cause a bias?
I appreciate your help.
Yes you should use the preprocessing step. You can retrain the model without it but the first layers will learn to center your datas so this is a waste of parameters.
If you do not recenter your performances will suffer.
Great thread on reddit : https://www.reddit.com/r/MachineLearning/comments/3q7pjc/why_is_removing_the_mean_pixel_value_from_each/