## Algean: An Open Framework for Machine Learning on a Heterogeneous Cluster

**Naif Tarafdar**<sup>1</sup>, Giuseppe Di Guglielmo<sup>2</sup>, Philip C Harris<sup>3</sup>, Jeffrey D Krupa<sup>3</sup>, Vladimir Loncar<sup>4</sup>, Dylan S Rankin<sup>3</sup>, Nhan Tran<sup>5</sup>, Zhenbin Wu<sup>6</sup>, Qianfeng Shen<sup>1</sup> and Paul Chow<sup>1</sup>

University of Toronto<sup>1</sup> Columbia University<sup>2</sup> Massachusetts Institute of Technology<sup>3</sup> CERN<sup>4</sup> Fermilab<sup>5</sup> University of Illinois<sup>6</sup>



## **Machine Learning**

- One of the most popular topics of research
  - In many areas, many applications (e.g medical, financial, safety, transportation etc.)
  - Also within the computing community
- Wide usage in world pushes limits of devices
  - Metrics include performance and energy
  - Leading many researchers to consider heterogeneity!

February 18, 2020

CMC Accelerating Al Workshop



#### **Heterogeneity All Around Us**

Snapdragon 630 Mobile Platform

| Extended            | Features                      | Snapdragon<br>X12 LTE modem | Adreno 508<br>Graphics Processing<br>Unit (GPU)                 | Comprehens          | sive RF Desig              |
|---------------------|-------------------------------|-----------------------------|-----------------------------------------------------------------|---------------------|----------------------------|
| Quick<br>charge     | Touch                         | Wi-Fi                       | Display Video<br>Processing Unit Processing Unit<br>(DPU) (VPU) | Filters             | RF<br>Transceiver          |
| Power<br>management | Fingerprint<br>Audio<br>Codec | Hexagon DSP                 | Qualcomm<br>Spectra 160                                         | Power<br>Amplifier  | Antenna                    |
|                     |                               | Quelconen All-Ways Aware    | Camera                                                          | Envelope<br>Tracker | Switch<br>Antenna<br>Tuner |
|                     |                               | Qualcomm●<br>Aqstic™ Audio  | CPU                                                             | Tracker             |                            |
|                     |                               | Location                    | Qualcomm<br>Mobile Security                                     |                     |                            |

This Photo by Unknown author is licensed under CC BY-NC.



This Photo by Unknown author is licensed under <u>CC</u><u>BY-SA-NC</u>.



<u>This Photo</u> by Unknown author is licensed under <u>CC</u> <u>BY-NC-ND</u>.



3

February 18, 2020

#### Applying Machine Learning to a Heterogeneous Environment

- Challenge: How do you design machine learning algorithms for a heterogenous space?
  - Hard enough with a homogenous computing environment
  - Is there a framework for such a thing?
- Challenge: If such a framework exists can we get both flexibility and performance?

February 18, 2020



# Outline

#### Brief Motivation

- Overview of machine learning frameworks
  - Categorized as an abstraction layer stack
- Overview of Algean
  - HLS4ML
  - Galapagos
- Results

February 18, 2020

CMC Accelerating Al Workshop



## **MACHINE LEARNING FRAMEWORKS**

February 18, 2020



## **Many Popular Examples!**

- Such as
  - Tensorflow
  - PyTorch
  - Caffe
  - Intel DLA
  - Xilinx XfDNN
- What do these different frameworks offer?

**Tensor**Flow

- Depends on who you ask!

February 18, 2020

CMC Accelerating Al Workshop





**Applications & Algorithms** 

Cluster Deployment & Communication

Hardware

February 18, 2020

Applications & Algorithms

Cluster Deployment & Communication

Hardware

E.g: Neural net layers, quantization, compression, pruning

February 18, 2020

CMC Accelerating Al Workshop





Applications & Algorithms

Cluster Deployment & Communication

Hardware

E.g: Physical Connections (PCIe, ethernet etc.), Communication Protocols

February 18, 2020



Applications & Algorithms

Cluster Deployment & Communication

Hardware

E.g: Hardware circuit (multipliers, shifters), memory architecture (caching etc.) 11

February 18, 2020

Applications & Algorithms

Cluster Deployment & Communication

Hardware

- Allows researchers to pick and choose layers they wish to configure
- Collapsable/Expandable for specific application and infrastructure!



February 18, 2020

CMC Accelerating Al Workshop

## **AIGEAN OVERVIEW**

February 18, 2020

CMC Accelerating Al Workshop



Poc D DP

## **AlGean Introduction**

- Like the archipelago and sea
- Combines two existing frameworks:
  - HLS4ML:
    - HLS IP cores of ML IP
  - Galapagos
    - Connects and deploys heterogeneous distributed application across multiple nodes



February 18, 2020

## HLS4ML

- Open source project
- Input:
  - Description of FPGA resources
    - LUT, BRAM, DSP
  - Description of neural net
    - PyTorch support
- Output:
  - HLS synthesizable C++ that fits within resource constraints implementing neural net
- Tunable HLS code, made to fit the FPGA

February 18, 2020













## Galapagos

- Heterogeneous Stack
- Allows users to create flexible heterogeneous clusters across CPUs/FPGAs
- Seamlessly prototype by implementing both on CPU and FPGA
  - Galapagos ensures functional portability for network communication
  - Essentially "network-connected" HLS kernels
    - For both SW and HW
  - Iterative development, selectively move bottleneck from SW to hardware without modifying code
- Flexibly change communication protocol without modifying user application
  - TCP, UDP, LI etc
  - User application is agnostic to this

February 18, 2020

CMC Accelerating Al Workshop

#### Communication Layer

Middleware/Network Layer

Hypervisor Layer

**Physical Hardware** 



Pod D De



## **Birth of Algean**

- HLS4ML creates HLS IP core to maximize FPGA utilization
- Galapagos can give a multi-FPGA fabric
- Tools combined to deploy neural-net on multi-FPGA Fabric

February 18, 2020

CMC Accelerating Al Workshop











## **Galapagos/HLS4ML Modifications**

- Galapagos uses bridging to connect HLS kernels together
  - By default uses AXI-Stream protocol
  - When kernels sends packet to an off-chip kernel,AXI-Stream packet is transformed into a network packet via a Galapagos bridge
- Galapagos modified to automate the formation of user specified bridges
  - Convert user protocol (HLS4ML) into Galapagos AXI-Stream
- HLS4ML modified to stream packets between IP cores



February 18, 2020

## RESULTS

February 18, 2020

CMC Accelerating Al Workshop



27

ROCDPP

## **Experiment Setup**

- CPUs
  - Xeon E5-2650
    - 24 Cores at 2.2 GHz
- FPGAs
  - Fidus Sidewinder
    - ZUI9EG FPGA
      - ~I Million logic cells, 35 MB BRAM, 1968 DSP slices
    - 100 GB network interface
  - 100 GB UDP core

February 18, 2020





- Latency send single flit
- Throughput: maximum throughput of link (varying packet size for software)

| Link                    | Latency    | Throughput |
|-------------------------|------------|------------|
| Software to<br>Hardware | 0.029 ms   | 0.244 GB/s |
| Hardware to<br>Hardware | 0.00017 ms | 100 GB/s   |
| Hardware to<br>Software | 0.0203 ms  | N/A        |



February 18, 2020

CMC Accelerating Al Workshop

- Larger the packet, higher the throughput.
- UDP packet size limited
  - No segmentation
  - MTU size
  - Jumbo Frames:
    8K

| Link                    | Latency    | Throughput |
|-------------------------|------------|------------|
| Software to<br>Hardware | 0.029 ms   | 0.244 GB/s |
| Hardware to<br>Hardware | 0.00017 ms | 100 GB/s   |
| Hardware to<br>Software | 0.0203 ms  | N/A        |

February 18, 2020

CMC Accelerating Al Workshop



Line-rate,
 same
 throughput at
 small and
 large packet
 size

| Link                    | Latency    | Throughput |
|-------------------------|------------|------------|
| Software to<br>Hardware | 0.029 ms   | 0.244 GB/s |
| Hardware to<br>Hardware | 0.00017 ms | 100 GB/s   |
| Hardware to<br>Software | 0.0203 ms  | N/A        |



31



• HW at line-

rate

UDP, SW
 can't keep up
 and we see
 packet drop

| Link                    | Latency    | Throughput |
|-------------------------|------------|------------|
| Software to<br>Hardware | 0.029 ms   | 0.244 GB/s |
| Hardware to<br>Hardware | 0.00017 ms | 100 GB/s   |
| Hardware to<br>Software | 0.0203 ms  | N/A        |



CMC Accelerating Al Workshop



## **Small Neural Network: Results**

- Single CPU, single FPGA, used in physics application to calculate energy of a particle
- I6K inferences
- SDAccel (without Algean) 3 ms
- Algean 6.3 ms
  - Latency of single inference 0.08 ms, we can do this since streaming, not possible via SDAccel
- Bottleneck: Sending data to FPGA via CPU network link

February 18, 2020



## **Small Neural Network: Takeaway**

- Comparison vs SDAccel shows that network link for a single FPGA can be competitive with PCIe
  - Network link wins in terms of scalability, many more available FPGAs via network vs PCIe
- Can stream data
  - Latency of single inference a lot faster
- Should target larger application
  - We can do this as we have a large multi-FPGA fabric!

February 18, 2020



## **Autoencoder: Results**

- Autoencoder implemented in both SDAccel on single FPGA and Algean using 3 FPGAs
- SDAccel: Single FPGA, higher reuse factor to fit logic

- 0.26 ms

- Algean: Three FPGAs
  - 0.08 ms, more than 3x improvement

February 18, 2020

CMC Accelerating Al Workshop



#### **Autoencoder: Takeaway**

- Using a larger fabric allows us to implement larger circuits
- The difficulty of communication between multi-FPGA is abstracted away

February 18, 2020

## **Resnet-50**

- Large fabric allows us to target Resnet-50
- Individual layers per FPGA
- Work in progress
- Estimated throughput: 5 ms (longest layer) per image
  - Functionality first, no internal optimizing yet

February 18, 2020

CMC Accelerating Al Workshop



## **SUMMARY AND CONCLUSION**

February 18, 2020

CMC Accelerating Al Workshop



#### **Summary**

- Multi-FPGA/CPU neural net framework by leveraging and combining HLS4ML and Galapagos frameworks
- Tunable IP cores, flexible communication
- ML HLS IP cores deployed onto cluster of network connected FPGAs and CPUs
- Communication abstracted away from user



39

ACC DOD

## Conclusion

- Network connected FPGAs/CPUs are more scalable than traditional PCIe
- Creation of larger fabrics with network connected FPGAs opens door for more complex algorithms
- Many opportunities to explore in multi-FPGA ML

February 18, 2020

CMC Accelerating Al Workshop

