Google Cloud ML and GCS Bucket issues - google-cloud-storage

I'm using open source Tensorflow implementations of research papers, for example DCGAN-tensorflow. Most of the libraries I'm using are configured to train the model locally, but I want to use Google Cloud ML to train the model since I don't have a GPU on my laptop. I'm finding it difficult to change the code to support GCS buckets. At the moment, I'm saving my logs and models to /tmp and then running a 'gsutil' command to copy the directory to gs://my-bucket at the end of training (example here). If I try saving the model directly to gs://my-bucket it never shows up.
As for training data, one of the tensorflow samples copies data from GCS to /tmp for training (example here), but this only works when the dataset is small. I want to use celebA, and it is too large to copy to /tmp every run. Is there any documentation or guides for how to go about updating code that trains locally to use Google Cloud ML?
The implementations are running various versions of Tensorflow, mainly .11 and .12

There is currently no definitive guide. The basic idea would be to replace all occurrences of native Python file operations with equivalents in the file_io module, most notably:
open() -> file_io.FileIO()
os.path.exists() -> file_io.file_exists()
glob.glob() ->
file_io.get_matching_files()
These functions will work locally and on GCS (as well as any registered file system). Note, however, that there are some slight differences in file_io and the standard file operations (e.g., a different set of 'modes' are supported).
Fortunately, checkpoint and summary writing do work out of the box, just be sure to pass a GCS path to tf.train.Saver.save and tf.summary.FileWriter.
In the sample you sent, that looks potentially painful. Consider monkey patching the Python functions to map to the TensorFlow equivalents when the program starts to only have to do it once (demonstrated here).
As a side note, all of the samples on this page show reading files from GCS.

Related

AWS SageMaker - Realtime Data Processing

My company does online consumer behavior analysis and we do realtime predictions using the data we collected from various websites (with our java script embedded).
We have been using AWS ML for real time prediction but now we are experimenting with AWS SageMaker it occurred to us that the realtime data processing is a problem compared to AWS ML. For example we have some string variables that AWS ML can convert to numerics and use them for real time prediction in AWS ML automatically. But it does not look like SageMaker can do it.
Does anyone have any experience with real time data processing and prediction in AWS SageMaker?
It sounds like you're only familiar with the training component of SageMaker. SageMaker has several different components:
Jupyter Notebooks
Labeling
Training
Inference
You're most likely dealing with #3 and #4. There are a few ways to work with SageMaker here. You can use one of the built-in algorithms which provide both training and inference containers that can be launched on SageMaker. To use these you can work entirely from the console and just point at your data in S3, similar to AWS ML. If you're not using the built-in algos then you can use the sagemaker-python-sdk to create both training and prediction containers if you're using a common framework like tensorflow, mxnet, pytorch, or others. Finally, if you're using a super custom algorithm (which you weren't if you're porting from AWS ML) then you can bring your own docker container for training and for inference.
To create an inference endpoint you can go to the console under the inference section and click around to build your endpoint. See the gif here for an example:
Beyond that if you want to use code to invoke the endpoint in real time you can use any of the AWS SDKs, I'll demonstrate with the python SDK boto3 here:
import boto3
sagemaker = boto3.client("runtime.sagemaker")
response = sagemaker.invoke_endpoint(EndpointName="herpderp", Body="some content")
In this code if you needed to convert the incoming string values to numerical values then you could easily do that with the code.
Yes it can! you have to create a Pipeline (Preprocess + model + Postprocess) and deploy it as endpoint for real time inference. you can double check the inference example in sagemaker github site. it's using sagemaker-python-sdk to train and deploy.
1: This is for small data sklearn model.
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/scikit_learn_inference_pipeline
2: it also support big data (spark ML Pipeline serving container), you can also find the example in its official github.
AWS SageMaker is a robust machine learning service in AWS that manages every major aspect of machine learning implementation, including data preparation, model construction, training and fine-tuning, and deployment.
Preparation
SageMaker uses a range of resources to make it simple to prepare data for machine learning models, even though it comes from many sources or is in a variety of formats.
It's simple to mark data, including video, images, and text, that's automatically processed into usable data, with SageMaker Ground Truth. GroundWork will process and merge this data using auto-segmentation and a suite of tools to create a single data label that can be used in machine learning models. AWS, in conjunction with SageMaker Data Wrangler and SageMaker Processing, reduces a data preparation phase that may take weeks or months to a matter of days, if not hours.
Build
SageMaker Studio Notebooks centralize everything relevant to your machine learning models, allowing them to be conveniently shared along with their associated data. You can choose from a variety of built-in, open-source algorithms to start processing your data with SageMaker JumpStart, or you can build custom parameters for your machine learning model.
Once you've chosen a model, SageMaker starts processing data automatically and offers a simple, easy-to-understand interface for tracking your model's progress and performance.
Training
SageMaker provides a range of tools for training your model from the data you've prepared, including a built-in debugger for detecting possible errors.
Machine Learning
The training job's results are saved in an Amazon S3 bucket, where they can be viewed using other AWS services including AWS Quicksight.
Deployment
It's pointless to have strong machine learning models if they can't be easily deployed to your hosting infrastructure. Fortunately, SageMaker allows deploying machine learning models to your current services and applications as easy as a single click.
SageMaker allows for real-time data processing and prediction after installation. This has far-reaching consequences in a variety of areas, including finance and health. Businesses operating in the stock market, for example, may make real-time financial decisions about stock and make more attractive acquisitions by pinpointing the best time to buy.
Incorporation with Amazon Comprehend, allows for natural language processing, transforming human speech into usable data to train better models, or provide a chatbot to customers through Amazon Lex.
In conclusion…
Machine Learning is no longer a niche technological curiosity; it now plays a critical role in the decision-making processes of thousands of companies around the world. There has never been a better time to start your Machine Learning journey than now, with virtually unlimited frameworks and simple integration into the AWS system.
In this case, you will need to preprocess your data before feeding it into the InvokeEndpoint request body. If you use python, you can use int('your_integer_string') or float('your_float_string') to convert a string to an integer or float. If you use java, you can use Integer.parseInt("yourIntegerString") or Long.parseLong("yourLongString") or Double.parseDouble("yourDoubleString") or Float.parseFloat("yourFloatString").
Hope this helps!
-Han

Distributed tensorflow source code

I wanted to check the source code of the distributed training feature of tensorflow and its overall structure. Worker-PS relations, etc. However I am lost in tensorflow's repository. Can someone guide me through the repository and point the source code I am looking for?
Unfortunately, not all tensorflow code (especially the part related to distributed computation) is open source. To quote Aurélien Géron from Hands-On Machine Learning with Scikit-Learn and TensorFlow:
The TensorFlow whitepaper presents a friendly dynamic placer algorithm that auto-magically distributes operations across all available devices, taking into account things like the measured computation time in previous runs of the graph, estimations of the size of the input and output tensors to each operation, the amount of RAM available in each device, communication delay when transferring data in and out of devices, hints and constraints from the user, and more. Unfortunately, this sophisticated algorithm is internal to Google; it was not released in the open source version of TensorFlow.
But here are the main entry points of TF distributed in the public repo:
Cluster in tensorflow/python/grappler/cluster.py
Server and ClusterSpec in tensorflow/python/training/server_lib.py
worker_service.proto in tensorflow/core/protobuf/worker_service.proto
To dive deep you'll need to enter native C++ code in tensorflow/core/distributed_runtime package, e.g., here's gRPC server implementation.

How to automatically edit over 100k files on GCS using Dataflow?

I have over 100 thousand files on Google Cloud Storage that contain JSON objects and I'd like to create a mirror maintaining the filesytem structure, but with some fields removed from the content of files.
I tried to use Apache Beam on Google Cloud Dataflow, but it splits all files and I can't maintain the structure anymore. I'm using TextIO.
The structure I have is something like reports/YYYY/MM/DD/<filename>
But Dataflow outputs to output_dir/records-*-of-*.
How can I make Dataflow not split the files and output them with the same directory and file structure?
Alternatively, is there a better system to do this kind of edits on a large number of files?
You can not directly use TextIO for this, but Beam 2.2.0 will include a feature that will help you write this pipeline yourself.
If you can build a snapshot of Beam at HEAD, you can already use this feature. Note: the API may change slightly between the time of writing this answer and the release of Beam 2.2.0
Use Match.filepatterns() to create a PCollection<Metadata> of files matching the filepattern
Map the PCollection<Metadata> with a ParDo that does what you want to each file using FileSystems:
Use the FileSystems.open() API to read the input file and then standard Java utilities for working with ReadableByteChannel.
Use FileSystems.create() API to write the output file.
Note that Match is a very simple PTransform (that uses FileSystems under the hood) and another way you can use it in your project is by just copy-pasting (the necessary parts of) its code into your project, or studying its code and reimplementing something similar. This can be an option in case you're hesitant to update your Beam SDK version.

How to make Weka API work with MongoDB?

I'm looking to use WEKA to train and predict from data in MongoDB. Specifically, I intend to use Weka API to analyse data (e.g. build a recommendation engine). But I have no idea how to proceed, because the data in MongoDB is stored in the BSON format, while WEKA uses the ARFF format. I would like to use the WEKA API to read data from MongoDB, analyse it, and provide recommendations to the user in real-time. I can not find a bridge beween WEKA and MongoDB.
Is this even possible or should I try another approach?
Before I begin, I should say that WEKA isn't the best tool for working with Big Data. If you really have Big Data, you will likely want to use Spark and the Hadoop family as they are more suited to analysis.
To answer your question as written, I would advise doing the training manually (i.e. creating a training file using any programmatic tools available to you) and pretraining a model. These models can then be saved and integrated into a program accordingly.
For testing, you can follow the official instructions, but I usually take a bit of a shortcut: I usually preprocess my data into a CSV-like format (as if it was going into an ARFF file) and just prepend a valid ARFF header (the same one as your training file uses). From there, it is very easy to test the instances. In my experience, this greatly simplifies the process of writing code that actually makes novel predictions.

How to combine version control with data analysis

I do a lot of solo data analysis, using a combination of tools such as R, Python, PostgreSQL, and whatever I need to get the job done. I use version control software (currently Subversion, though I'm playing around with git on the side) to manage all of my scripts, but the data is perpetually a challenge. My scripts tend to run for a long period of time (hours, or occasionally days) to generate small or large datasets, which I in turn use as input for more scripts.
The challenge I face is in how to "rollback" what I do if I want to check out my scripts from an earlier point in time. Getting the old scripts is easy. Getting the old data would be easy if I put my data into version control, but conventional wisdom seems to be to keep data out of version control because it's so darned big and cumbersome.
My question: how do you combine and/or manage your processed data with a version control system on your code?
Subversion, maybe other [d]vcs as well, supports symbolic links. The idea is to store raw data 'well organized' on a filesystem, while tracking the relation between 'script' and 'generated date' with symbolic links under version control.
data -> data-1.2.3
All your scripts will call load data to retrieve a given dataset, being linked through versioned symbolic link to a given dataset.
Using this approach, code and calculated datasets are tracked within one tool, without bloating your repository with binary data.