Horovod Timeline and MPI Tracing in Azure Machine Learning Workspace(MPI Configuration) - distributed-computing

All,
I am trying to train a distributed model using Horovod on Azure Machine Learning Service as shown below.
estimator = TensorFlow(source_directory=script_folder,
entry_script='train_script.py',
script_params=script_params,
compute_target=compute_target_gpu_4,
conda_packages=['scikit-learn'],
node_count=2,
distributed_training=MpiConfiguration(),
framework_version = '1.13',
use_gpu=True
)
run = exp.submit(estimator)
How to enable Horovod timeline?
How to enable more detailed MPI tracing to see the communication between the nodes?
Thanks.

The following uses the Tensorflow Estimator class in the SDK, that distributed_training is set to Mpi().
Another sample using Horovod to train a genism sentence similarity model.
https://github.com/microsoft/nlp-recipes/blob/46c0658b79208763e97ae3171e9728560fe37171/examples/sentence_similarity/gensen_train.py

Related

How to save models in pytorch from tpu to cpu

I am training a Neural Network model with pytorch.
Since this model is very complicated, I made use of the pytorch_xla package to use TPU. I finished training the model and now I want to save the weight so I will be able to use them from any envrionment.
I tried to save the data like so
file_name = "model_params"
torch.save(model.state_dict(), file_name)
and when I tried to load them (from environment which does not support TPU)
model.load_state_dict(torch.load(file name))
I got the following error
NotImplementedError: Could not run 'aten::empty_strided' with arguments from the 'XLA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty_strided' is only available for these backends: [CPU, Meta, BackendSelect, Named, ADInplaceOrView, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, UNKNOWN_TENSOR_TYPE_ID, AutogradMLC, AutogradHPU, AutogradNestedTensor, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].
Is there a way to do what I want?

How do I profile my code in GPflow2? What happened to dump_timeline and gpflowrc?

I am trying profiling my code in GPflow 2 as I need to know which part of my code consumes the most CPU time. In GPflow 1 there was a gpflowrc file where you could set dump_timeline = True but this has changed in GPflow 2 (to the gpflow.config module) and I can't find a similar option in there.
Working with TensorFlow 2 is much simpler than TensorFlow 1, so with GPflow 2 we're relying a lot more on TensorFlow built-ins instead of adding extra code - GPflow 2 is "just another TensorFlow graph". So you should be able to directly use the TensorFlow profiler: see this blog post for an introduction and the guide in the TensorFlow documentation for more details.
(According to https://github.com/tensorflow/tensorboard/issues/2874, TensorBoard's "Trace Viewer" for the timeline should now work fine in Firefox, but if you encounter any issues on the visualization side it's worth trying out Chrome.)

Osisoft pi data archive- How to create piServer collective

I am writing a c# application that uses OSIsoft AFSDK dll (2016 version). My application should define PI server collective and send data to this collective.
But what I'm trying to figure out is:
using C# code, how to define a PI data archive servers as a collective? and how to configure one of them to be the primary server?
I've been searching for code examples and couldn't find any code.
Can anyone post an example or a reference to such an example?
you can add a new Collective to the Known Servers Table using the next code snippet:
using OSIsoft.AF.PI;
PIServers myPIServers = new PIServers();
PIServer newPIServer = myPIServers.Add("NewPIServer");
Then you can access to PI Collective information using:
PICollective collective = newPIServer.Collective
You can consult the Live Library:
https://techsupport.osisoft.com/Documentation/PI-AF-SDK/Html/T_OSIsoft_AF_PI_PIServers.htm
And also there's a very good forum for anything regarding OSIsoft PI: https://pisquare.osisoft.com/
Hope it helps!

CNTK TransferLearning.model

I've been following the Build your own image classifier using Transfer Learning tutorial found at this link.
At the end of the tutorial, it says you use your own data set, which I created by following the example in the tutorial. It successfully completed and created a folder ~/CNTK-Samples-2-3-1/Examples/Image/TransferLearning/Output, which is expected. Inside this output folder is predictions.txt, predOutput.txt, and TransferLearning.model.
My question is, how do I access TransferLearning.model? I'm not sure what to do with it and I can't find anything in the documentation allowing me to alter it or run it.
Once you successfully create your model, you can use it in Python environment (with CNTK) to classify your images. The code should look something like this:
#load the trained transfer learning model
trained_model = load_model(model_file)
#for every new image:
#get predictions for a single image
probs = eval_single_image(trained_model, img_file, image_width, image_height)

How to train natural language classifier using Fluent

I'm using Fluent library to fire a request to Natural Language Classifier service so as to 'train' the data.
Documentation says following parameters to be passed:
name=training_data; type=file; description=training data
name=training_meta_data; type=file; description=meta data to identify language etc
Below is my code sample:
File trainingCSVFile = new File("path to training file");
Request request=Request.Post(<bluemix service url>).
bodyFile(trainingCSVFile, ContentType.TEXT_PLAIN).
bodyString("{\"language\":\"en\",\"name\":\"PaymentDataClassifier\"}", ContentType.APPLICATION_JSON);
How ever am getting internal server error which plausibly due to my request format. Can any one help me how to pass the above mentioned parameters using Fluent library on priority?
I'm going to assume that you are using Java and suggest you to use the Java SDK. You can find examples to use not only Natural language Classifier but all the Watson services + Alchemy services.
Installation
Download the jar
or use Maven
<dependency>
<groupId>com.ibm.watson.developer_cloud</groupId>
<artifactId>java-sdk</artifactId>
<version>2.10.0</version>
</dependency>
or use Gradle
'com.ibm.watson.developer_cloud:java-sdk:2.10.0'
The code snippet to create a classifier is:
NaturalLanguageClassifier service = new NaturalLanguageClassifier();
service.setUsernameAndPassword("<username>", "<password>");
File trainingData = new File("/path/to/csv/file.csv");
Classifier classifier = service.createClassifier("PaymentDataClassifier", "en", trainingData);
System.out.println(classifier);
The training duration will depend on your data but once it's trained you can do:
Classification classification = service.classify(classifier.getId(), "Is it sunny?");
System.out.println(classification);
Feel free to open an issue in the GitHub repo if you have problems