Hi I am not able to save pyspark model in my local folder.
I am using the following code:
pipelinePath = './F:/Model_saved/'
pipeline.write().overwrite().save(pipelinePath)
any help would be highly appreciated.
Related
Im trying to create dataset from cfn in the way of joining 3 datasets like AxB, AxC to create base dataset (I can create these through web GUI) through cloudformation in LogicalTableMap and JoinInstruction in Dataset template. However, I cannot create the dataset with many errors in many deploys showing like “LogicalTableMap must have a single root” or “Circular Dependency” joining from base dataset. it there any suggests?
Thank in advance
I am trying an example provided by networkx here. This example uses a geopandas dataset in a filename cholera_cases.pkdg.
# read in example data from a geopackage file. Geopackages
# are a format for storing geographic data that is backed
# by sqlite. geopandas reads data relying on the fiona package,
# providing a high-level pandas-style interface to geographic data.
cases = geopandas.read_file("cholera_cases.gpkg")
The example, however, does not mention where and how to obtain this dataset. I combed GeoPandas website up and down and am unable to locate this file. I want to view the format of its content and run this example.
If anyone is aware of where to obtain this kind of geopandas file, please advise.
If you go to their GitHub, you can find it in their repo. Here: https://github.com/networkx/networkx/tree/main/examples/geospatial
Might be worth cloning the repo to play with the examples.
As general advice, on webpages for projects like these, I like to always check for links to their GitHub/GitLab/other because you get to see the project behind the scenes, and a local clone can be kept up to date
using a TensorFlow estimator in Azure ML Service with the following config.
from azureml.core.runconfig import TensorflowConfiguration
distributed_training = TensorflowConfiguration()
distributed_training.worker_count = 3
est = TensorFlow(source_directory=script_folder,
script_params=script_params,
compute_target=compute_target,
node_count=4,
distributed_training=distributed_training,
use_gpu=True,
entry_script=train_script)
run = exp.submit(est)
It seems like in the run with this configuration, individual workers come up with their own instances of trained models and try to register the model multiple times. Is Distributed Training something I need to deal with in the Tensorflow training script?
You'd need to handle the model saving in training script: gather the model coefficients to a single rank, and save them to file from that rank.
You can then register the model outside the training script, using run.register_model and passing the folder / file of the saved model as an argument.
I saved a linear regression model locally, which I want to use in later applications.
How can I import and call a saved model which was not created in the current application?
I'm using Scala on IntelliJ.
This is how I saved the model:
LRmodel.write.save(ORCpath+"LinearRegModel")
To be able to load the model, you need to have defined the same model prior. So it is not straight forward to load the model in a new environment.
ie. You cannot load a model which contain 4 nodes into a model which contains 10.
You can see here: https://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression
That the method to load is:
val sameModel = LogisticRegressionModel.load(
sc,
"target/tmp/scalaLogisticRegressionWithLBFGSModel"
)
Different application isn't a problem. You can simply load through the save path as per #Wonay's instruction.
The issue is when you move to another file system, say a move from local to Hadoop...or just to another local PC. In this case, frankly, it is best to just regenerate the model on your new filesystem.
I'm learning scala and am trying without success to load a model that I have run/fit on data. It took 6 hours to run and I'm afraid I'm going to have to rerun this if I can't figure out a way to save/load the output.
I ran the KMeans as part of a pipeline. I saved the output of the pipeline that I ran on my training dataset as 'model' and that was what I tried to load.
After running the model I was able to save(or at least I thought I was) it using:
model.write.overwrite().save("/analytics_shared/qoe/km_model")
My question is how do I load it back so I can then use it to cluster/score new data without having to rerun the training that takes 6 hours.
You should use KMeansModel.load method.
load(path: String): KMeansModel Reads an ML instance from the input path, a shortcut of read.load(path).
In your case, it'd be as follows:
import org.apache.spark.ml.clustering.KMeansModel
val model = KMeansModel.load("/analytics_shared/qoe/km_model")
I ran the KMeans as part of a pipeline. I saved the output of the pipeline that I ran on my training dataset as 'model' and that was what I tried to load.
With ML Pipeline it's even easier as you simply replace KMeansModel with PipelineModel.
import org.apache.spark.ml.PipelineModel
val model = PipelineModel.load("/analytics_shared/qoe/km_model")