How to make sure we have same number of columns in train and production data in Pyspark? - pyspark

I have trained a pyspark model and saved pipeline and model in drive.
Now I am trying to predict for future datasets by loading pipeline.
pipelineModel = PipelineModel.load("s3://data-production/pipelineModel_v1")
new_test= pipelineModel.transform(new_df1)
Here I am getting error in my production dataset .
How to make sure columns in training dataset be replicated to production data as well in Pyspark?

Related

Incremental training of ALS model

I'm trying to find out if it is possible to have "incremental training of ALS model" on kinesis streaming data using MLlib in Apache Spark.
I have real time interaction of user from kinesis stream, but to get updated prediction results I need to train the model on whole data. This takes some time.
I'm trying to figure out if I can do incremental training of ALS model on streaming data, but cannot find an answer.
Incremental training of ALS

Bizarre results with classification models

I am running a few classification models like logistic regression and catboosting. I have taken away part of the train set as unseen data .
When I train both on train and unseen data and get the metrics using log regression I am getting all the metrics like accuracy , AUC,F1,Recall all greater than 0.90. As it's a class imbalance problem i have even balanced the classes using smote .And also I have used z score to normalise all variables
Where the model performs well on train and unseen data and on test data , when I actually run on the set of data ( unlabelled) which I want to predict model is only giving me 10 1s. And rest 150k 0s
Could there be really an issue with my model or it is indeed the way the data is ?

Using clustering classification as regression feature?

I am attempting to use KMeans clustering to create a feature for an XGBOOST regression. The problem is, I am not sure if there is data leakage. It is data with a date, so right now I am clustering on the first 70% of data sorted by date, and using the same as my training set.
Included in the clustering is my target variable. Using the cluster as a feature provides a huge boost to test scores, so I worry that this is causing data leakage. However, the clusters used for test scores are unseen data in the test set.
Is this valid, or is it causing data leakage? Thank you

How to project PCA features from train data to test data in spark scala?

I read this link that explains: Anomaly detection with PCA in Spark
But whats the code to extract PCA features and project them from training data to test data?
From what I understood, we have to use the same set of features for train on test data.

Orange3 how to reapply preprocessing to new data

Is there any way to reapply a preprocessment done in a training dataset to a new dataset of experimental data for submission of the transformed data to the already trained classifier?
The preprocessor modifies the domain on the training data set. If you want to apply the same transformations on the testing (experimental) data, you apparently have to cast it into the same domain, as the Orange's built-in predictors seem to do:
train = preprocess(train)
test = Table(train.domain, test)