

## AUTOMATIC MIXED PRECISION & TENSORRT

# **MIXED PRECISION?**

## **TENSOR CORES BUILT FOR AI AND HPC**

Mixed Precision Accelerator - Delivering Up To 5X Throughput of FP32<sup>1</sup>



## Matching Accuracy for FP32 and Mixed Precision

| Model Script                      | Framework  | Data Set                      | Automatic or<br>Manual<br>Mixed-Precision | FP32<br>Accuracy   | Mixed-Precisi<br>on Accuracy | FP32<br>Throughput            | Mixed-Precision<br>Throughput | Speedup       |
|-----------------------------------|------------|-------------------------------|-------------------------------------------|--------------------|------------------------------|-------------------------------|-------------------------------|---------------|
| BERT Q&A                          | TensorFlow | SQuaD                         | AMP                                       | 90.83<br>Top 1     | 90.99<br>Top 1               | 66.65<br>sentences/sec        | 129.16<br>sentences/sec       | 1.94          |
| SSD w/RN50                        | TensorFlow | COCO 2017                     | AMP                                       | 0.268<br>mAP       | 0.269<br>mAP                 | 569<br>images/sec             | 752<br>images/sec             | 1.32          |
| GNMT<br>③                         | PyTorch    | WMT16<br>English to<br>German | Manual                                    | 24.16<br>BLEU      | 24.22<br>BLEU                | 314,831<br>tokens/sec         | 738,521<br>tokens/sec         | 2.35          |
| Neural<br>Collaborative<br>Filter | PyTorch    | MovieLens<br>20M              | Manual                                    | 0.959<br>HR        | 0.960<br>HR                  | 55,004,590<br>samples/sec     | 99,332,230<br>items/sec       | 1.81          |
| U-Net<br>Industrial<br>(1)        | TensorFlow | DAGM 2007                     | AMP                                       | 0.965-0.988        | 0.960-0.988                  | 445<br>images/sec             | 491<br>images/sec             | 1.10          |
| ResNet-50 v1.5                    | MXNet      | ImageNet                      | Manual                                    | 76.67<br>Top 1%    | 76.49<br>Top 1%              | 2,957<br>images/sec           | 10,263<br>images/sec          | 3.47          |
| Tacotron 2 /<br>WaveGlow 1.0      | PyTorch    | LJ Speech<br>Dataset          | AMP                                       | 0.3629/<br>-6.1087 | 0.3645/<br>-6.0258           | 10,843 tok/s<br>257,687 smp/s | 12,742 tok/s<br>500,375 smp/s | 1.18/<br>1.94 |

Values are measured with model running on (1) DGX-1V 8GPU 16G, (2) DGX-1V 8GPU 32G or (3) DGX-2V 16GPU 32G

# ENABLING AUTOMATIC MIXED PRECISION

#### Add Just A Few Lines of Code, Get Upto 3X Speedup

| TensorFlow | <pre>NVIDIA Container 19.07+, TF 1.14+ and TF 2+, explicit optimizer wrapper available:<br/>opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)<br/>os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'<br/>OR<br/>export TF_ENABLE_AUTO_MIXED_PRECISION=1</pre> |  |  |
|------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| PyTorch    | APEX<br>model, optimizer = amp.initialize(model, optimizer, opt_level="01")<br>with amp.scale_loss(loss, optimizer) as scaled_loss:<br>scaled_loss.backward()                                                                                                                     |  |  |
| MXNet      | <pre>amp.init() amp.init_trainer(trainer) with amp.scale_loss(loss, trainer) as scaled_loss:     autograd.backward(scaled_loss)</pre>                                                                                                                                             |  |  |

More details: <u>https://developer.nvidia.com/automatic-mixed-precision</u>

3 🚳 NVIDIA.

## **MIXED PRECISION**

What is the benefit?

Using mixed precision and Volta your networks can be:

- 1. 3-4x faster
- 2. Reduce memory consumption and bandwidth pressure
- 3. just as powerful

with no architecture change.

🕺 NVIDIA.

## A MIXED PRECISION SOLUTION



🕺 nvidia.

## **MIXED PRECISION TRAINING**



# WHY TENSORRT?

## AI INFERENCE NEEDS TO RUN EVERYWHERE



## NVIDIA TensorRT From Every Framework, Optimized For Each Target Platform



# CHALLENGES WITH CURRENT APPROACHES

| Requirement                    | Challenges                                                                                                                                                                                              |  |  |
|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| High Throughput                | Unable to processing high-volume, high-velocity data Impact: Increased cost (\$, time) per inference                                                                                                    |  |  |
| Low Response Time              | <ul> <li>Applications don't deliver real-time results</li> <li>&gt; Impact: Negatively affects user experience (voice recognition, personalized recommendations, real-time object detection)</li> </ul> |  |  |
| Power and Memory<br>Efficiency | <ul> <li>Inefficient applications</li> <li>➤ Impact: Increased cost (running and cooling), makes deployment infeasible</li> </ul>                                                                       |  |  |
| Deployment-Grade<br>Solution   | <ul> <li>Research frameworks not designed for production</li> <li>&gt; Impact: Framework overhead and dependencies increases time to solution and affects productivity</li> </ul>                       |  |  |

5 📀 nvidia

## ANNOUNCING TensorRT 7 ASR, NLU & TTS | 1000+ Kernels | FP32, FP16, INT8



## TensorRT INTEGRATED WITH TENSORFLOW

Speed Up TensorFlow Inference With TensorRT Optimizations

## Speed up TensorFlow model inference with TensorRT with new TensorFlow APIs

Simple API to use TensorRT within TensorFlow easily

Sub-graph optimization with fallback offers flexibility of TensorFlow and optimizations of TensorRT

Optimizations for FP32, FP16 and INT8 with use of Tensor Cores automatically

#### TensorFlow-TensorRT Inference Workflow



```
# Set Precision
```

conversion\_params = trt.DEFAULT\_TRT\_CONVERSION\_PARAMS.\_replace(
 precision\_mode=trt.TrtPrecisionMode.INT8)

```
# Convert to TF-TRT Graph
converter = trt.TrtGraphConverterV2(
    input_saved_model_dir=input_saved_model_dir,
    conversion_params=conversion_params)
```

```
# INT8 Calibration
converter.convert(calibration_input_fn=my_calibration_fn)
```

```
# Run Inference
converter.save(output_saved_model_dir)
```

#### Available in TensorFlow 2.0 and 1.15

https://github.com/tensorflow/tensorflow

developer.nvidia.com/tensorrt

## TensorRT ONNX PARSER

High-Performance Inference for ONNX Models

Optimize and deploy models from ONNXsupported frameworks to production

Apply TensorRT optimizations to any ONNX framework (Caffe 2, Microsoft Cognitive Toolkit, MxNet & PyTorch)

Import TensorFlow and Keras through converters (tf2onnx, keras2onnx)

Use with C++ and Python apps

20+ New Ops in TensorRT 7

Support for Opset 11 (See List of Supported Ops)



developer.nvidia.com/tensorrt

## **TensorRT** Optimization

### Deploy highly-optimized Conversational AI apps in production environments

New API to define loops found in RNNs

Compiler fuses pointwise ops, generates optimized kernels, and fuses ops across time steps

Run ASR, NLU and TTS within 300 ms, a requirement for real time apps, 10x perf vs CPU

Models Supported: BERT, MT-DNN, RoBERTa, Tacotron 2, WaveRNN, DeepASR, GNMT, LSTM Peephole, LSTM Autoencoder



### **TENSORRT PERFORMANCE**



3.5GHz Turbo (Broadwell) HT On

batch size 2, Tesla V100-PCIE-16GB, E52690 v4@2.60GHz 3.56Hz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake with AVX512.

developer.nvidia.com/tensorrt

<sup>7</sup> 壑 nvidia

## **TENSORRT OPTIMIZATIONS**





#### Layer & Tensor Fusion

Weights & Activation Precision Calibration



#### Kernel Auto-Tuning



Dynamic Tensor Memory



- > Optimizations are completely automatic
- Performed with a single function call

| 13 | 13 pengine = trt.utils.uff_to_trt_engine(G_LOGGER, |  |  |  |  |
|----|----------------------------------------------------|--|--|--|--|
| 14 | uff_model,                                         |  |  |  |  |
| 15 | parser,                                            |  |  |  |  |
| 16 | INFERENCE_BATCH_SIZE,                              |  |  |  |  |
| 17 | 1<<20,                                             |  |  |  |  |
| 18 | trt.infer.DataType.FLOAT)                          |  |  |  |  |
| 19 |                                                    |  |  |  |  |

11 📀 nvidia





#### **Un-Optimized Network**



#### **TensorRT Optimized Network**



12 📀 nvidia





- Vertical Fusion
- Horizonal Fusion
- Layer Elimination

| Network         | Layers<br>before | Layers<br>after |
|-----------------|------------------|-----------------|
| VGG19           | 43               | 27              |
| Inception<br>V3 | 309              | 113             |
| ResNet-152      | 670              | 159             |

#### TensorRT Optimized Network





# FP16, INT8 PRECISION CALIBRATION



#### Precision calibration for INT8 inference:

- Minimizes information loss between FP32 and INT8 inference on a calibration dataset
- Completely automatic



#### Reduced Precision Inference Performance (ResNet50)



14 🞯 nvidia



# FP16, INT8 PRECISION CALIBRATION

|            | FP32<br>Top 1  | INT8<br>Top 1 | Difference |
|------------|----------------|---------------|------------|
| Googlenet  | <b>68.87</b> % | 68.49%        | 0.38%      |
| VGG        | 68.56%         | 68.45%        | 0.11%      |
| Resnet-50  | 73.11%         | 72.54%        | 0.57%      |
| Resnet-152 | 75.18%         | 74.56%        | 0.61%      |

#### Precision calibration for INT8 inference:

- Minimizes information loss between FP32 and INT8 inference on a calibration dataset
- Completely automatic



Reduced Precision Inference Performance (ResNet50)



15 💿 nvidia

## **KERNEL AUTO-TUNING DYNAMIC TENSOR MEMORY**



100's of custom built kernel tuning ٠ based on target GPU Architecture.





Tesla V100

Jetson TX2

Drive PX2

Multiple parameters:

Batch size

• • •

- Input dimensions
- Filter dimensions





**Dynamic Tensor Memory** 

- Reduces memory footprint and improves memory re-use
- Manages memory allocation for ٠ each tensor only for the duration of its usage

16 📀 nvidia

# WHY TENSORRT INFERENCE SERVER?

## **INEFFICIENCY LIMITS INNOVATION**

#### **Difficulties with Deploying Data Center Inference**





#### **Custom Development**



Developers need to reinvent the plumbing for every application

## **NVIDIA TENSORRT INFERENCE SERVER**

#### **Production Data Center Inference Server**



Maximize real-time inference performance of GPUs

Quickly deploy and manage multiple models per GPU per node

Easily scale to heterogeneous GPUs and multi GPU nodes

Integrates with orchestration systems and auto scalers via latency and health metrics

Now open source for thorough customization and integration

26 💿 💿 🕺 26

## **INFERENCE SERVER ARCHITECTURE**

#### Available with Monthly Updates



#### Models supported

- TensorFlow GraphDef/SavedModel
- TensorFlow and TensorRT GraphDef
- TensorRT Plans
- Caffe2 NetDef (ONNX import)
- ONNX graph
- PyTorch JIT (.pb)

Multi-GPU support

Concurrent model execution

Server HTTP REST API/gRPC

Python/C++ client libraries

27 💿 💿 🛛 27

## **TENSORRT INFERENCE SERVER OVERVIEW**

A typical TensorRT Inference Server pipeline can be broken down into the following 8 steps:

- 1. Client serializes the inference request into a message and sends it to the server (Client Send)
- 2. Message travels over the network from the client to the server (Network)
- 3. Message arrives at server, and is deserialized (Server Receive)
- 4. Request is placed on the queue (Server Queue)
- 5. Request is removed from the queue and computed (Server Compute)
- 6. Completed request is serialized in a message and sent back to the client (Server Send)
- 7. Completed message travels over network from the server to the client (Network)
- 8. Completed message is deserialized by the client and processed as a completed inference request (Client Receive)

## TENSORRT INFERENCE SERVER METRICS FOR AUTOSCALING

Before TensorRT Inference Server - 800 FPS

- One model per GPU
- Requests are steady across all models
- Utilization is low on all GPUs

Before TensorRT Inference Server - 5,000 FPS



- Spike in requests for blue model
- GPUs running blue model are being fully utilized
- Other GPUs remain underutilized

## TENSORRT INFERENCE SERVER METRICS FOR AUTOSCALING

After TensorRT Inference Server - 5,000 FPS

- Load multiple models on every GPU
- Load is evenly distributed between all GPUs

After TensorRT Inference Server - 15,000 FPS



- Spike in requests for blue model
- Each GPU can run the blue model concurrently
- Metrics to indicate time to scale up
  - GPU utilization
  - Power usage
  - Inference count
  - Queue time
  - Number of requests/sec

💿 NVIDIA