TensorFlow For Dummies
Book image
Explore Book Buy On Amazon
If you succeeded in launching jobs locally, deploying your applications to the cloud shouldn't present any difficulty. But be mindful of two issues:
  • You need to upload training/evaluation data to Cloud Storage.
  • The ML Engine may not support the versions of the packages you need.
Before you execute either of the applications in the ch13 directory of the TensorFlow For Dummies downloadable code, you’ll need to upload the mnist_test.tfrecords and mnist_train.tfrecords files to a Cloud Storage bucket. For example, if your project's ID is $(PROJECT_ID), you can create a bucket named $(PROJECT_ID)_mnist in the central United States with the following command:
gsutil mb -c regional -l us-central1 gs://$(PROJECT_ID)_mnist
After you create the bucket, you can upload the two MNIST files to the bucket with the following command:
gsutil cp mnist_test.tfrecords mnist_train.tfrecords gs://$(PROJECT_ID)_mnist
After the command executes, it's a good idea to check that Cloud Storage created objects for the two files. You can verify this by running the command gsutil ls gs://$(PROJECT_ID)_mnist.

Running a remote training job

After you upload your test/evaluation data, you can launch a training job with the following command:
gcloud ml-engine jobs submit training $(JOB_ID)
$(JOB_ID) provides a unique identifier for the training job. After you launch the job, you can use this ID to check on the job's status.

In addition to identifying the job, you need to tell the ML Engine where to find your package and your input data. You also need to tell the engine where it should store output files. You can provide this information by following the command with flags, and this table lists each of them.

Flags for Cloud Training Jobs

Flag Description
--module-name=MODULE_NAME
Identifies the module to execute
--package-path=PACKAGE_PATH
Path to the Python package containing the module to execute
--job-dir=JOB_DIR
Path to store output files
--staging-bucket=STAGING_BUCKET
Bucket to hold package during operation
--region=REGION
The region of the machine learning job
--runtime-version=RUNTIME_VERSION
The version of the ML Engine for the job
--stream-logs
Block until the job completes and stream the logs
--scale-tier=SCALE_TIER
The job's operating environment
--config=CONFIG
Path to a job configuration file

The --module-name, --package-path, and --job-dir flags serve the same purposes as the similarly named flags for local training jobs. The --staging-bucket flag identifies the bucket to hold the deployed package. The --region flag accepts one of the regions listed in the table.

By default, deployed applications run on the latest stable version of the ML Engine. You can configure this by setting the --runtime-version flag. You can get the list of versions at cloud.google.com/ml-engine/docs/runtime-version-list.

Set the --stream-logs flag because it forces the command to block until the job completes. As the job runs, the console prints messages from the remote log. Aborting the command (Ctrl-C) doesn't affect the remote job.

By default, applications uploaded to the ML Engine can run only on a single CPU. You can configure the execution environment by setting the --scale-tier flag to one of the values listed in this table.

Scale Tier Values

Value Description
basic A single worker on a CPU
basic-gpu A single worker with a GPU
basic-tpu A single worker instance with a Cloud TPU
standard-1 Many workers and a few parameter servers
premium-1 A large number of workers and many parameter servers
custom Define a cluster

If you set --scale-tier to basic-gpu, you can execute your code on an Nvidia Tesla K80 GPU. This has 4,992 CUDA cores and 24 GB of GDDR5 memory. If you set --scale-tier to basic-tpu, you can execute your code on one or more of Google's Tensor Processing Units (TPUs). At the time of this writing, Google restricts TPU access to developers in its Cloud TPU program.

If you set --scale-tier to standard-1 or premium-1, you can run your job on a cluster of processors. If you set --scale-tier to custom, you can configure the cluster by assigning the --config flag to the name of a configuration file.

Running a remote prediction job

If you upload a SavedModel to a Cloud Storage bucket, you can launch a prediction job with the following command:
gcloud ml-engine jobs submit prediction $(JOB_ID)
This command accepts flags that specify where the prediction job should read its input and write its output. This table lists each of these flags.

Flags for Cloud Prediction Jobs

Flag Description
--model-dir=MODEL_DIR
Path of the bucket containing the saved model
--model=MODEL
Name of the model to use for prediction
--input-paths=INPUT_PATH, [INPUT_PATH,…]
Path to the input data to use for prediction
--data-format=DATA_FORMAT
Format of the input data
--output-path=OUTPUT_PATH
Path to store the prediction results
--region=REGION
The region of the machine learning job
--batch-size=BATCH_SIZE
Number of records per batch
--max-worker-count=MAX_WORKER_COUNT
The maximum number of workers to employ for parallel processing
--runtime-version=RUNTIME_VERSION
The version of the ML Engine for the job
--version=VERSION
Version of the model to be used

When you launch a remote prediction job, you must identify the model's name with --model or the bucket containing the model files with --model-dir. You also need to identify the location of the input files with --input-paths.

The ML Engine accepts prediction input data in one of three formats. You can identify the format of your data by setting --data-format to one of the following values:

  • text: Text files with one line per instance
  • tf-record: TFRecord files
  • tf-record-gzip: GZIP-compressed TFRecord files
The last required flag is --output-path. This tells the ML Engine which Cloud Storage bucket should contain the prediction results.

Viewing a job's status

After you launch a job, you can view the job’s status in two ways. First, you can use gcloud commands, such as the following:
  • gcloud ml-engine jobs list: List the jobs associated with the default project along with their statuses and creation times
  • gcloud ml-engine jobs describe $(JOB_ID) --summarize: Provide detailed information about a specific job in human-readable format
To check on a job, visit the Google Cloud Console. If you click the menu bars in the upper left and scroll down, you see an entry titled ML Engine. This entry leads to two options: Jobs and Models.

If you click the ML Engine→   Jobs option, the page lists all the jobs associated with the project. If you click on a job name, a new page provides detailed information about the job's execution, including its status and any log messages.

About This Article

This article is from the book:

About the book author:

Matthew Scarpino has been a programmer and engineer for more than 20 years. He has worked extensively with machine learning applications, especially those involving financial analysis, cognitive modeling, and image recognition. Matthew is a Google Certified Data Engineer and blogs about TensorFlow at tfblog.com.

This article can be found in the category: