Run a tensorflow code in distributed mode on google cloud ML - distributed-computing

Does anybody know what changes need to be made to trainer in order to run a job on distributed platform on google cloud ML ?
It will of great help if somebody can share few articles or docs about the same.

By and large, your distributed TensorFlow program will be exactly that -- distributed TensorFlow, with minimal -- or even no -- cloud-specific changes. The best resource for distributed TensorFlow is this tutorial on tensorflow.org. The tutorial walks you through the low-level way of doing things.
There is also a higher-level API, currently in contrib (so API may change and will move out of contrib in a future version), that simplifies the amount of boilerplate code you have to write for distributed training. The official tutorial is here.
Once you've understood the general TensorFlow bits (whether high-level or low-level APIs), there are some specific elements that must be present in your code to get it to run on CloudML Engine. In the case of the low-level TensorFlow APIs, you'll need to parse the TF_CONFIG environment variable to setup your ClusterSpec. This is exemplified in this example (see specifically this block of code).
One advantage of the higher-level APIs, is that all of that parsing is already taken care of for you. Your code should just generally work. See this example. The important piece is that you will need to use learn_runner.run() (see this line), which will work locally and in the cloud to train your model.
Of course, there are other frameworks as well, e.g., TensorFX.
After you've structured your code appropriately, then you simply select an appropriate scale tier that has multiple machines when launching your training job. (See Chuck Finley's answer for an example)
Hope it helps!

If you have your model constructed with Tensorflow Estimators, the changes you need to do are very minimal. You can basically plug your code into e.g. this boilerplate code.

Is your question answered by the argument "scale-tier" in Run Distributed Training in the Cloud?
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--runtime-version 1.0 \
--module-name trainer.task \
--package-path trainer/ \
--region $REGION \
--scale-tier STANDARD_1 \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 1000 \
--verbose-logging true

Related

How can I visualize TensorRT network graph?

Our current flow: Conversation of tf2->onnx->TensorRT (all16 and 32 and 8 bits options)
Is there an existing tool like https://github.com/lutzroeder/netron (or any other way) to see the output model of TensorRT?
You can now visualize tensorrt engine graphs using https://github.com/NVIDIA/TensorRT/tree/main/tools/experimental/trt-engine-explorer
It also offers other info about the engine.
There is no way to do this, because the TensorRT model (engine) is optimized for certain hardware (A specific NVIDIA GPU architecture). It is already "compiled" (similar to coreML mlmodel's are compiled to mlmodelc, where as TensorFlow (or TFLite) models are not "compiled" for specific hardware.
From the NVIDIA forums: https://forums.developer.nvidia.com/t/visualizing-tensorrt-engine/69032
I assume you are asking for something analogous to tensorboard or netron? Unfortunately, I’m not aware a way for TRT.
From the creator of Netron (lutzroeder):
Unless this format is documented not sure much can be done here. Netron has a browser-only version so even if a Python or C++ API existed for supporting non-inference scenarios it wouldn't be useful.
You can still of course visualize the ONNX model right before you create the TensorRT one. TensorRT just needs an optimized model, so I don't expect it to be different.
Is the goal to visualize it after the compression from onnx/TensorRT? Or just to visualize it in general? You can always just use the keras tool.
I believe you can use a TensorRT model as well with netron, based on this youtube video.

Chaos engineering best practice

I studied the principles of chaos, and looks for some opensource project, such as chaosblade which is open sourced by Alibaba, and mangle, by vmware.
These tools are both fault injection tools, and do nothing to analysis on the tested system.
According to the principles of chaos, we should
1.Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
2.Hypothesize that this steady state will continue in both the control group and the experimental group.
3.Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
4.Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
so how we do step 4? Should we use monitoring system to monitor some major metrics, to check the status of the system after fault injection.
Is there any good suggestions or best practice?
so how we do step 4? Should we use monitoring system to monitor some major metrics, to check the status of the system after fault injection.
As always the answer is it depends.... It depends how do you want to measure your hypothesis, it depends on the hypothesis itself and it depends on the system. But normally it makes totally sense to introduce metrics to improve/increase the observability.
If your hypothesis is like Our service can process 120 requests in a second, even if one node fails. Then you could do it via metrics to measure that yes, but you could also measure it via the requests you send and receive the responses back. It is up to you.
But if your Hypothesis is I get a response for an request which was send before a node goes down. Then it makes more sense to verify this directly with the requests and response.
At our project we use for example chaostoolkit, which lets you specify the hypothesis in json or yaml and related action to prove it.
So you can say I have a steady state X and if I do Y, then the steady state X should be still valid. The toolkit is also able to verify metrics if you want to.
The Principles of Chaos are a bit above the actual testing, they reflect the philosophy of designed vs actual system and system under injection vs baseline, but are a bit too abstract to apply in everyday testing, they are a way of reasoning, not a work process methodology.
I'm think the control group vs experiment wording is one especially doubtful part - you stage a test (injection) in a controlled environment and try to catch if there is a user-facing incident, SLA breach of any kind or a degradation. I do not see where there is a control group out there if you test on a stand or dedicated environment.
We use a very linear variety of chaos methodology which is:
find failure points in the system (based on architecture, critical user scenarios and history of incidents)
design choas test scenarios (may be a single attack or more elaborate sequence)
run tests, register results and reuse green for new releases
start tasks to fix red tests, verify the solutions when they are available
One may say we are actually using the Principles of Choas in 1 and 2, but we tend to think of choas testing as quite linear and simple process.
Mangle 3.0 released with an option for analysis using resiliency score. Detailed documentation available at https://github.com/vmware/mangle/blob/master/docs/sre-developers-and-users/resiliency-score.md

How can I mock DynamoDB access via Spark in Scala?

I have a Spark job written in Scala that ultimately writes out to AWS DynamoDB. I want to write some unit tests around it, but the only problem is I don't have a clue how to go about mocking the bit that writes to DynamoDB. I'm making use of their emr-dynamodb-connector class, which means I'm not using any dependency injection (otherwise this would be easy).
After I read in some RDD data using Spark, I do some simple transforms on it into a Pair RDD of type (org.apache.hadoop.io.Text, org.apache.hadoop.dynamodb.DynamoDBItemWritable). So my code's only brush-up with Dynamo is by creating DynamoDBItemWritable objects. That class doesn't inherently contain any logic to utilize the AWS SDK to save anything; it's essentially just a data object. My code then calls this:
val conf = new Configuration()
conf.set("dynamodb.servicename", "dynamodb")
conf.set("dynamodb.input.tableName", "MyOutputTable")
conf.set("dynamodb.output.tableName", "MyInputTable")
conf.set("dynamodb.endpoint", "https://dynamodb.us-east-1.amazonaws.com")
conf.set("dynamodb.regionid", "us-east-1")
conf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
conf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
myTransformedRdd.saveAsHadoopDataset(new JobConf(conf)
...and the connector magically registers the right classes and makes the right calls so that it effectively saves the results to DynamoDB accordingly.
I can't mock SparkSession because it has a private constructor (that would be extremely messy anyway). And I don't have any direct way, as far as I know, to mock the DynamoDB client. Is there some magic syntax in Scala (or Scalatest, or Scalamock) to allow me to tell it that if it ever wants to instantiate a Dynamo client class, that it should use a mocked version instead?
If not, how would I go about testing this code? I suppose theoretically, perhaps there's a way to set up a local, in-memory instance of Dynamo and then change the value of dynamodb.endpoint but that sounds horribly messy just to get a unit test working. Plus I'm not sure it's possible anyway.
Take a look at LocalStack. It provides an easy-to-use test/mocking framework for developing AWS-related applications by spinning up the AWS-compatible APIs on your local machine or in Docker. It supports two dozen of AWS APIs and DynamoDB is among them. It is really a great tool for functional testing without using a separate environment in AWS for that.
If you need only DynamoDB there is another tool: DynamoDB Local, a Docker image with Amazon DynamoDB onboard.
Both are as simple as starting a Docker container:
docker run -p 8000:8000 amazon/dynamodb-local
docker run -P localstack/localstack
And if you're using JUnit 5 for the tests, let me recommend you JUnit 5 extensions for AWS, a few JUnit 5 extensions that could be useful for testing AWS-related code. These extensions can be used to inject clients for AWS service clients provided by tools like localstack (or the real ones). Both AWS Java SDK v 2.x and v 1.x are supported.

Google Cloud NL API term/classification quality and batch processing on Traditional Chinese (zh-Hant) data

I've tested Google Cloud NL API v1.0 these days. I mainly use Traditional Chinese(a.k.a. zh-Hant) data. After the testing, I find the quality is not satisfactory, classification is not right, too many one-character terms (many of them should be stop words), the worst quality is for unknown word recognition.
Also, some analysis methods (e.g. entity-sentimental) don't support zh-Hant (that I can only use 'en' to run zh-Hant data, pitty).
Does anyone know if NL API provides any way, e.g. set configuration, set parameters, or run some process, so as to improve training result?
Does anyone actually have experience on using NL API generated result to add some value-added feature on a business product or service?
Also, if I want to feed high-volume data, is there a library or SDK, that I can use to write code to carry out batch-in-batch-out processing?

How to tag a scientific data processing tool to ensure repeatability

we develop a data processing tool to extract some scientific results out of a given set of raw data. In data science it is very important that you can re-obtain your results and repeat the calculations, that led to a result set
Since the tool is evolving, we need a way to find out which revision/build of our tool generated a given result set and how to find the corresponding source from which the tool was build.
The tool is written in C++ and Python; gluing together the C++ parts using Boost::Python. We use CMake as a build system generating Make files for Linux. Currently the project is stored in a subversion repo, but some of us already use git resp. hg and we are planning to migrate the whole project to one of them in the very near future.
What are the best practices in a scenario like this to get a unique mapping between source code, binary and result set?
Ideas we are already discussing:
Somehow injecting the global revision number
Using a build number generator
Storing the whole sourcecode inside the executable itself
This is a problem I spend a fair amount of time working on. To what #VonC has already written let me add a few thoughts.
I think that the topic of software configuration management is well understood and often carefully practiced in commercial environments. However, this general approach is often lacking in scientific data processing environments many of which either remain in, or have grown out of, academia. However, if you are in such a working environment, there are readily available sources of information and advice and lots of tools to help. I won't expand on this further.
I don't think that your suggestion of including the whole source code in an executable is, even if feasible, necessary. Indeed, if you get SCM right then one of the essential tests that you have done so, and continue to do so, is your ability to rebuild 'old' executables on demand. You should also be able to determine which revision of sources were used in each executable and version. These ought to make including the source code in an executable unnecessary.
The topic of tying result sets in to computations is also, as you say, essential. Here are some of the components of the solution that we are building:
We are moving away from the traditional unstructured text file that is characteristic of the output of a lot of scientific programs towards structured files, in our case we're looking at HDF5 and XML, in which both the data of interest and the meta-data is stored. The meta-data includes the identification of the program (and version) which was used to produce the results, the identification of the input data sets, job parameters and a bunch of other stuff.
We looked at using a DBMS to store our results; we'd like to go this way but we don't have the resources to do it this year, probably not next either. But businesses use DBMSs for a variety of reasons, and one of the reasons is their ability to roll-back, to provide an audit trail, that sort of thing.
We're also looking closely at which result sets need to be stored. A nice approach would be only ever to store original data sets captured from our field sensors. Unfortunately some of our computations take 1000s of CPU-hours to produce so it is infeasible to reproduce them ab-initio on demand. However, we will be storing far fewer intermediate data sets in future than we have in the past.
We are also making it much harder (I'd like to think impossible but am not sure we are there yet) for users to edit result sets directly. Once someone does that all the provenance information in the world is wrong and useless.
Finally, if you want to read more about the topic, try Googling for 'scientific workflow' and 'data provenance' similar topics.
EDIT: It's not clear from what I wrote above, but we have modified our programs so that they contain their own identification (we use Subversion's keyword capabilities for this with an extension or two of our own) and write this into any output that they produce.
You need to consider git submodules of hg subrepos.
The best practice in this scenario os to have a parent repo which will reference:
the sources of the tool
the result set generated from that tool
ideally the c++ compiler (won't evolve every day)
ideally the python distribution (won't evolve every day)
Each of those are a component, that is an independent repository (Git or Mercurial).
One precise revision of each component will be reference by a parent repository.
The all process is representative of a component-based approach, and is key in using an SCM (here Software Configuration Management) at its fullest.