using a TensorFlow estimator in Azure ML Service with the following config.
from azureml.core.runconfig import TensorflowConfiguration
distributed_training = TensorflowConfiguration()
distributed_training.worker_count = 3
est = TensorFlow(source_directory=script_folder,
script_params=script_params,
compute_target=compute_target,
node_count=4,
distributed_training=distributed_training,
use_gpu=True,
entry_script=train_script)
run = exp.submit(est)
It seems like in the run with this configuration, individual workers come up with their own instances of trained models and try to register the model multiple times. Is Distributed Training something I need to deal with in the Tensorflow training script?
You'd need to handle the model saving in training script: gather the model coefficients to a single rank, and save them to file from that rank.
You can then register the model outside the training script, using run.register_model and passing the folder / file of the saved model as an argument.
Related
I'm working on bark-beetle outbreaks in the Alpine forests. To do so, I work on UAV-acquired orthophotos with multispectral bands (RGB, NIR, RE). I want to proceed to a raster (VRT) supervised classification, according to field-acquired ROIs.
I successfully did this using SAGA GUI. I'm now trying to repeat the same process but with using QGIS, as I have all my working layers sorted in a project. I want to get the same supervised classification with the built-in SAGA extension, but the algorithm asks here (but not in SAGA GUI) for a mandatory "Load Statistics from File" parameter.
How do I have to set this parameter?
By reading SAGA documentation, I saw it should be the path to the stats file (about my raster to classify?), but no further informations were provided about the content of this stats file. I don't know how to create it, nor if there is a way to create it using QGIS or SAGA GUI.
Neither did I find help about this in the SAGA documentation or somewhere else on internet.
at the start of the experiment, I get the error message 'Error in the model during iteration 5 (numbers vary between 2, 3, and 5)' and it points to the agent-statistics Dataset from main that I try to add to a histogram dataset in my experiment. Is it possible that somewhere I try to collect agent statistics before the agents are initialized?
To be more precise about my setting: My root agent for the experiment is an agent called 'firms'. Within that agent, I have added the dataset 'DSUsers', which simply collects the agent statistics (item.inState(Firm....)) located in the main agent. The dataset DSUsers is what I try to add to the histogram dataset 'data' for the parameter variation experiment with the settings described below. (unfortunately, I can't add screenshots yet, I am too young a member...)
In the experiment, I use the following setup:
Before simulation run:
data.reset();
After simulation run:
data.add(root.DSUsers);
DSUsers being the dataset that I collect from the main agent-statistics. The top-level agent of my experiment is not the main agent (could that be a problem?).
So the process looks something like this:
Collecting agent statistics in main -> Using a dataset on the agent-level which collects the statistics from main and stores it -> importing this dataset to a histogram dataset in the experiment.
I get a large error message in the console, the first error pointing to
private double _DSPledgeUsers_YValue() {
return
get_Main().firms.NPledgeUsers()
;}
Found the problem-maker, however, I am not sure why it works:
Somehow the dataset I created in the agent caused all of the agents' states in the statechart to automatically update on every time step. This was a problem in my model because some states were supposed to update only occasionally. So I simply moved the dataset to main.
I do not want to speculate, but if you encounter the same problem, check the location of your datasets!
I saved a linear regression model locally, which I want to use in later applications.
How can I import and call a saved model which was not created in the current application?
I'm using Scala on IntelliJ.
This is how I saved the model:
LRmodel.write.save(ORCpath+"LinearRegModel")
To be able to load the model, you need to have defined the same model prior. So it is not straight forward to load the model in a new environment.
ie. You cannot load a model which contain 4 nodes into a model which contains 10.
You can see here: https://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression
That the method to load is:
val sameModel = LogisticRegressionModel.load(
sc,
"target/tmp/scalaLogisticRegressionWithLBFGSModel"
)
Different application isn't a problem. You can simply load through the save path as per #Wonay's instruction.
The issue is when you move to another file system, say a move from local to Hadoop...or just to another local PC. In this case, frankly, it is best to just regenerate the model on your new filesystem.
I'm learning scala and am trying without success to load a model that I have run/fit on data. It took 6 hours to run and I'm afraid I'm going to have to rerun this if I can't figure out a way to save/load the output.
I ran the KMeans as part of a pipeline. I saved the output of the pipeline that I ran on my training dataset as 'model' and that was what I tried to load.
After running the model I was able to save(or at least I thought I was) it using:
model.write.overwrite().save("/analytics_shared/qoe/km_model")
My question is how do I load it back so I can then use it to cluster/score new data without having to rerun the training that takes 6 hours.
You should use KMeansModel.load method.
load(path: String): KMeansModel Reads an ML instance from the input path, a shortcut of read.load(path).
In your case, it'd be as follows:
import org.apache.spark.ml.clustering.KMeansModel
val model = KMeansModel.load("/analytics_shared/qoe/km_model")
I ran the KMeans as part of a pipeline. I saved the output of the pipeline that I ran on my training dataset as 'model' and that was what I tried to load.
With ML Pipeline it's even easier as you simply replace KMeansModel with PipelineModel.
import org.apache.spark.ml.PipelineModel
val model = PipelineModel.load("/analytics_shared/qoe/km_model")
Recently I got access to run my codes on a cluster. My code is totally paralleizable but I don't know how to best use its parallel nature. I've to compute elements of a big matrix and each of them are independent of the others. I want to submit the job to run on several machine (like 100) to speed up the computation of the matrix.
Right now, I wrote a script to submit multiple jobs each responsible to compute a part of the matrix and save it in a .mat file. At the end I'm merging them to get the whole matrix. For submitting each individual job, I've created a new .m file (run1.m, run.2, ...) to set a variable and then run the function to compute the associated part in the matrix. So basically run1.m is
id=1;compute_dists_matrix
and then compute_dists_matrix uses id to find the part it is going to compute. Then I wrote a script to create run1.m through run60.m and the qsub them to the cluster.
I wonder if there is a better way to do this using some MATLAB features for example. Because this seems to be a very typical task.
Yes, it works, but is not ideal, and as you say is a common problem. Matlab has a parallel programming toolkit.
Does your cluster have this? If so, the distributed arrays is worth having a look at. If they don't have access to it, then what you are doing is the only other way. You can wrap your run1.m,run2.m in a controlling script to automate it for you...
I believe you could use command line arguments for the id and submit jobs with a range of values for this id. Command line arguments can be processed by launching MATLAB from the command line without the IDE and providing the name of the script to be executed and the list of arguments. I would think you can set up dependencies in your job manager and create a "reduce" script to merge the partial results (from files). The whole process could be managed from a single script that would generate the id & other necessary arguments and submit the processing & postprocessing jobs with dependencies.