How to load saved KMeans model (in ML Pipeline)? - scala

I'm learning scala and am trying without success to load a model that I have run/fit on data. It took 6 hours to run and I'm afraid I'm going to have to rerun this if I can't figure out a way to save/load the output.
I ran the KMeans as part of a pipeline. I saved the output of the pipeline that I ran on my training dataset as 'model' and that was what I tried to load.
After running the model I was able to save(or at least I thought I was) it using:
model.write.overwrite().save("/analytics_shared/qoe/km_model")
My question is how do I load it back so I can then use it to cluster/score new data without having to rerun the training that takes 6 hours.

You should use KMeansModel.load method.
load(path: String): KMeansModel Reads an ML instance from the input path, a shortcut of read.load(path).
In your case, it'd be as follows:
import org.apache.spark.ml.clustering.KMeansModel
val model = KMeansModel.load("/analytics_shared/qoe/km_model")
I ran the KMeans as part of a pipeline. I saved the output of the pipeline that I ran on my training dataset as 'model' and that was what I tried to load.
With ML Pipeline it's even easier as you simply replace KMeansModel with PipelineModel.
import org.apache.spark.ml.PipelineModel
val model = PipelineModel.load("/analytics_shared/qoe/km_model")

Related

NullPointerException during Parameter Variation Experiment with agent statistics

at the start of the experiment, I get the error message 'Error in the model during iteration 5 (numbers vary between 2, 3, and 5)' and it points to the agent-statistics Dataset from main that I try to add to a histogram dataset in my experiment. Is it possible that somewhere I try to collect agent statistics before the agents are initialized?
To be more precise about my setting: My root agent for the experiment is an agent called 'firms'. Within that agent, I have added the dataset 'DSUsers', which simply collects the agent statistics (item.inState(Firm....)) located in the main agent. The dataset DSUsers is what I try to add to the histogram dataset 'data' for the parameter variation experiment with the settings described below. (unfortunately, I can't add screenshots yet, I am too young a member...)
In the experiment, I use the following setup:
Before simulation run:
data.reset();
After simulation run:
data.add(root.DSUsers);
DSUsers being the dataset that I collect from the main agent-statistics. The top-level agent of my experiment is not the main agent (could that be a problem?).
So the process looks something like this:
Collecting agent statistics in main -> Using a dataset on the agent-level which collects the statistics from main and stores it -> importing this dataset to a histogram dataset in the experiment.
I get a large error message in the console, the first error pointing to
private double _DSPledgeUsers_YValue() {
return
get_Main().firms.NPledgeUsers()
;}
Found the problem-maker, however, I am not sure why it works:
Somehow the dataset I created in the agent caused all of the agents' states in the statechart to automatically update on every time step. This was a problem in my model because some states were supposed to update only occasionally. So I simply moved the dataset to main.
I do not want to speculate, but if you encounter the same problem, check the location of your datasets!

Distributed Training with Tensorflow in AMLS

using a TensorFlow estimator in Azure ML Service with the following config.
from azureml.core.runconfig import TensorflowConfiguration
distributed_training = TensorflowConfiguration()
distributed_training.worker_count = 3
est = TensorFlow(source_directory=script_folder,
script_params=script_params,
compute_target=compute_target,
node_count=4,
distributed_training=distributed_training,
use_gpu=True,
entry_script=train_script)
run = exp.submit(est)
It seems like in the run with this configuration, individual workers come up with their own instances of trained models and try to register the model multiple times. Is Distributed Training something I need to deal with in the Tensorflow training script?
You'd need to handle the model saving in training script: gather the model coefficients to a single rank, and save them to file from that rank.
You can then register the model outside the training script, using run.register_model and passing the folder / file of the saved model as an argument.

How to import saved ML model on Scala?

I saved a linear regression model locally, which I want to use in later applications.
How can I import and call a saved model which was not created in the current application?
I'm using Scala on IntelliJ.
This is how I saved the model:
LRmodel.write.save(ORCpath+"LinearRegModel")
To be able to load the model, you need to have defined the same model prior. So it is not straight forward to load the model in a new environment.
ie. You cannot load a model which contain 4 nodes into a model which contains 10.
You can see here: https://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression
That the method to load is:
val sameModel = LogisticRegressionModel.load(
sc,
"target/tmp/scalaLogisticRegressionWithLBFGSModel"
)
Different application isn't a problem. You can simply load through the save path as per #Wonay's instruction.
The issue is when you move to another file system, say a move from local to Hadoop...or just to another local PC. In this case, frankly, it is best to just regenerate the model on your new filesystem.

Spark - Reload saved Featurization Pipeline vs instantiate new Pipeline with same stages

I would like to check if I'm missing any important points here.
My pipeline is only for Featurization. I understand that once a pipeline that includes an Estimator is fitted; then saving the pipeline will persist the params the Estimator has learned. So loading a saved pipeline in this case means not having to re-train the Estimator; which is a huge point.
However; for the case of a pipeline which only consists of a number of Transform stages; would I always get the same result on feature extraction from a input dataset using either of the below two approaches?
1)
Creating a pipeline with a certain set of stages; and configuration per stage.
Saving and reloading the pipeline.
Transforming an input dataset
versus
2)
Each time just instantiating a new pipeline (of course with the exact same set of stages; and configuration per stage)
Transforming the input dataset
So; alternative phrasing would be; as long as the exact set of stages; and configuration per stage is known; a Featurization pipeline can be efficiently (because there is no 'training an estimator' phase) recreated without using save or load?
Thanks,
Brent

How to keep temporary output files in Spark

I'm writing a map only sparkSQL job which looks like
val lines = sc.textFile(inputPath)
val df = lines.map { line => ... }.toDF("col0", "col1")
df.write.parquet(output)
As the job takes quite a long time to compute, I would like to save and keep the results of the tasks which successfully terminated, even if the overall job fails or gets killed.
I noticed that, during the computation, in the output directory some temporary files are created.
I inspected them and noticed that, since my job has only a mapper, what is saved there is the output of the successful tasks.
The problem is that the job failed and I couldn't analyse what it could compute because the temp files were deleted.
Does anyone have some idea how to deal with this situation?
Cheers!
Change the output committer to DirectParquetOutputCommitter.
sc.hadoopConfiguration.set("spark.sql.parquet.output.committer.class", "org.apache.spark.sql.parquet.DirectParquetOutputCommitter"
Note that if you've turned on speculative execution, then you have to turn it off to use a direct output committer.