# embedded VISION summit

# Is Your AI Data Pre-processing Fast Enough? Speed It Up Using rocAL<sup>™</sup>

Rajy Rawther PMTS Software Architect Advanced Micro Devices, Inc.





- Introduction: Why do we need rocAL?
- rocAL pipeline and architecture
- Operators for data loading and augmentations
- Flexible pipelines: scalability across multiple devices
- How to use rocAL?
- Deep dive into MLPerf object detection example with rocAL
- rocAL performance advantages
- rocAL use case in inference
- Conclusion

#### **The Problem**



embedded

VISION

# Feeding the Beast: How to Fully Utilize GPUs?

- GPU performance increases >2x every new generation
- Native pipelines mostly use CPU cores for preprocessing data before training
- As training gets faster with GPUs, the preprocessing needs to catch up
  - EPYC<sup>™</sup> + MI100: 64 CPU cores, 8 GPUs, 8 CPU cores/GPU
  - EPYC<sup>™</sup> + MI200: 64 cores, 8 GPUs(2x perf), 8 CPU cores/GPU, >300 TFLOPs (FP16)
- Falling CPU/GPU performance throttles the overall speed

High-performance pre-processing library which can load balance between CPU and GPU



#### 

embedded

VISION



### **Key Features of rocAL**

embedded VISION summit

- CPU and GPU based implementations for each operators
- Python and C++ APIs for easy integration and testing
- Flexible graphs to create custom pipeline utilizing CPU cores or GPU
- Supports many new augmentations like fish-eye, non-linear blend, water, RICAP, etc.
- Support for many workloads
  - Classification
  - Object detection
  - Pose estimation and segmentation
- Seamless interoperability with frameworks using rocAL framework plugins
- Optimized to give maximum performance on AMD EPYC<sup>™</sup> CPUs and AMD Instinct<sup>™</sup> GPUs

### What is a rocAL Pipeline?

A pipeline is a graph of data flow connected by node operators



embedded

VISION

### rocAL Pipeline Dataflow



embedded

VISION summit

### **rocAL** Architecture

embedded VISION summit

rocAL pipeline



#### rocAL Operators



#### 

embedded

VISION summit

#### rocAL Advantage

VISION summit

- One unified library that integrate to all the frameworks
- Optimized augmentation operations used among all
- Flexible to support different data formats (File folder reading, LMDB, TF Record, Record IO). Portable between frameworks



embedded





rocAL pipelines can be accessed using three simple steps (Define/Build and Run)

```
from amd.rocal.pipeline import pipeline_def
import amd.rocal.fn as fn
```

```
@pipeline_def
def example_pipeline():
    jpegs, labels = fn.readers.file(file_root=file_dir)
    images = fn.decoders.image(jpegs, device=decoder_device)
    resized_images = fn.resize(images, device, resize_w, resize_h)
    return resized images, labels
```

pipe = example\_pipeline(batch\_size=8, num\_threads=32,device\_id=0)
pipe.build()

Define

Build

Run

images, labels = pipe.run()

# Sample Output From Example\_ Pipeline

Each sample is decoded and resized to 224x224





embedded

VISION summit

#### embedded VISION **Flexible Pipelines: Scalability Across Devices** summit rocALPipeline 0 Allocated CPU cores and GPU Shard O Allocated CPU cores and GPU rocAL Pipeline 1 Shard 1 Input Dataset Shard 2 rocAL Pipeline 2 Allocated CPU cores and GPU Shard n Allocated CPU cores and GPU rocAL Pipeline n Each pipeline is configured with GPU device\_id and CPU core bindings

# rocAL's Impact In Performance (Miperf Resnet-50 Training)



embedded

VISION

## rocAL's Core: AMD RPP Library



embedded

VISION



#### RPP performance compared to native processing for batch\_size=128 on MI200



### **Example: SSD Object Detection Training Augmentations**

embedded VISION summit



#### Image with bboxes



# **MIperf SSD Training With rocAL**

def COCOPipeline( batch size, num threads, local rank , world size, device id, data dir, ann dir): Define pipe = Pipeline(batch size, num threads, device id=device id) with pipe: jpegs, bboxes, labels = fn.readers.coco(path=data dir, random shuffle=True) crop begin, crop size, bboxes, labels = fn.random bbox crop(bboxes, labels, device="cpu", aspect ratio=[0.5, 2.0], thresholds=[0, 0.1, 0.3, 0.5, 0.7, 0.9]) images decoded = fn.decoders.image slice(jpegs, crop begin, crop size, device="cpu", type = types.RGB) res images = fn.resize(images decoded, device="gpu", resize x=crop, resize y=crop) cl\_twist\_images = fn.color\_twist(res\_images, device="gpu", contrast\_rand, brightness\_rand, hue\_rand) bboxes = fn.bb flip(bboxes, ltrb=True, horizontal=flip coin) images = fn.crop mirror normalize(cl\_twist\_images, device="gpu", crop=(crop, crop), mirror=flip coin, mean=[0.485\*255,0.456\*255 ,0.406\*255 ], std=[0.229\*255 ,0.224\*255 ,0.225\*255 ]) bboxes, labels = fn.box encoder(bboxes, labels, device=rali device) pipe.set outputs(images, bboxes, labels)

| <pre>train_loader = rocALGenericIterator(pipe) pipe.build()</pre>                                            | Build |
|--------------------------------------------------------------------------------------------------------------|-------|
| <pre>for i, data in enumerate(train_loader):     images, bboxes, labels = data     # do model training</pre> | Run   |

#### 

embedded VISION

# rocAL Advantage in MLPerf SSD Training



embedded

VISION

### rocAL Use Case In Inference: Inference Server



embedded

VISION

### **Different Stages Of Inference Pipeline**



embedded VISION



- Meta data augmentations and new data-types are introduced to help with bounding-box and other meta-data augmentations
- CPU based decoding has a hit on performance even with TJpeg decoder
  - Hardware decoder using VCN
  - ROI based decoding

Challenges

- Memory management is tricky when we use mixed devices and variable batch\_size
- Discrepancies in image-processing transforms across different frameworks. Transforms produce different outputs.
- Video processing needs new data layout to represent sequences (NFHWC)

# Conclusion

- rocAL is the AMD open source accelerated data augmentation and data loading library
- It provides full pre-processing pipelines to be used for training or inference
- Has easy framework integration for today's machine learning workloads
- rocAL's hybrid pipelines help intelligent load balancing between CPU and GPU
- It is portable across multiple framework with one underlying library
- The AMD Open sourced RPP library provides the backbone for rocAL



embedded VISION

#### References



#### rocAL

https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX/tree/master/rocAL

#### **MIVisionX**

https://gpuopen-professionalcomputelibraries.github.io/MIVisionX/

#### RPP

https://github.com/GPUOpen-ProfessionalCompute-Libraries/MIVisionX

#### AMD ROCm

https://rocmdocs.amd.com/en/latest/

### Disclaimer

- embedded VISION summit
- The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
- THIS INFORMATION IS PROVIDED 'AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
- © 2022 Advanced Micro Devices, Inc. All rights reserved.
- AMD, the AMD Arrow logo, EPYC, Radeon, MI100, MI200, rocAL, RPP, MIVisionX, ROCm and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.

