



### Accelerating AI Workshop 2023 – Challenges and Opportunities in Cloud and Edge Computing

### **Polara: a RISCV multicore vector processor**

**Nizar El Zarif**, Mohammad Hossein Askari Hemmat, Theo Dupuis, Yoan Fournier Elisabeth Humblet, Francois Leduc-Primeau, Jean-Pierre David, Yvon Savaria



May -4 -2023





### outline



- Introduction
  - What are Ariane and Ara, Polara
  - Why vector instructions are important for AI
- Research objectives
- Sub-byte computation methods
- Polara implementation
- Building a custom AI framework
- Conclusion





### **CVA6 processor (Ariane)**



 Ariane is the codename for the risc V processor using RV64 instruction set.

POLYTECHNIQUE

MONTRÉAL

- CVA6 is a 6 stage in order, single issue wide processor with support to M,A and C extension.
- https://github.com/openhwgr oup/cva6









Ara



- Ara (CV-VEC) is the codename for the vector processor.
- Vectors and array processor can use SIMD programming model to improve computation throughput.
- Example of Array processors is x86 that supports AVX instructions, or ARM with neon.
- The vector in an array processor are fixed in size , as opposed to in vector processors which can use dynamic length up to 4096-bit operations.
- Ara can be configured with different numbers of lanes allowing more parallelism based on how wide the design is.









### What is Polara ?

**CORE-V** Polara





- Polara is aiming to create a multicore version of the ARA vectorial processor using the OpenPiton framework
- The ASIC will posses 4 instances or ARA/CVVEC with 4 lanes each









- SIMD allows multiple operation with a single instruction.
- This type of parallelism is called datalevel parallelism (DLP).
- DLP allows for better hardware utilization
- Vector processor and array processor are both different implementation of SIMD computation model









- AI workload can be divided into training and inference.
- Deep learning model usually use BLAS (basic linear algebra subprograms) libraries
- BLAS are set of programming routines that are commonly used in learn algebra operations
- Neural networks commonly use AXPY

$$y = \propto x + y$$

where x and y are vectors

• MATMUL is a second level BLAS and defined as

$$C \coloneqq AB + C$$

where C, A, and B are matrices







- Extending Ara to support low precision instruction and sub-byte computations. This would help with low precision Quantized neural networks and Binary neural networks
- 2. Build an FPGA model of the chip for verification
- 3. Build a custom chip from Ara + Ariane + extension design on GLOBALFOUNDERIES 22FDX 22 nm technology with the help of CMC
- 4. Build software stack and runtime to run on neural networks on the fabricated chip





POLYTECHNIQUE MONTRÉAL

SUB-BYTE COMPUTATIONS METHODS



Introduces software/hardware techniques to compute sub-byte dot product more efficiently





- M. Cowan, T. Moreau, T. Chen, J. Bornholt, and L. Ceze, 'Automatic generation of high-performance quantized machine learning kernels', 02 2020, pp. 305–316.
- [2] J. Won, J. Si, S. Son, T. J. Ham, and J. W. Lee, 'ULPPACK: Fast Sub-8-bit Matrix Multiply on Commodity SIMD Hardware', in *Proceedings of Machine Learning and Systems*, 2022, vol. 4, pp. 52–63.

SUB-BYTE COMPUTATIONS METHODS



Introduces software/hardware techniques to compute sub-byte dot product more efficiently

#### **BIT-SERIAL**[1] **ULPPACK** [2] Computation done between corresponding bits of operands Vectorizes the computation using scalar instructions in a serial manner More versatile (up to 4-bit precision) Efficient in SW on small precision (no HW added) Bit-serial complexity is $O(N \times M)$ 60 Suitable for very small precision 54.8 **GEMMLOWP** (typ. 1-bit to 2-bit) **QNNPACK** ■Bit-serial 30 Performance (GOPS) ULPPACK Requires modification on the HW 25 20 Exemple from [2] comparing both methods 15 Bit-serial performance decreases rapidely 10 whereas ULPPACK yields great 5 performance from W2A2 to W4A4 0 W2A2 W1A1 **W3A3** W4A4 **W5A5**



HNIQUE

 M. Cowan, T. Moreau, T. Chen, J. Bornholt, and L. Ceze, 'Automatic generation of high-performance quantized machine learning kernels', 02 2020, pp. 305–316.

[2] J. Won, J. Si, S. Son, T. J. Ham, and J. W. Lee, 'ULPPACK: Fast Sub-8-bit Matrix Multiply on Commodity SIMD Hardware', in Proceedings of Machine Learning and Systems, 2022, vol. 4, pp. 52–63.





#### RISC-V VECTOR PROCESSOR: PARALLELIZATION OF COMPUTATIONS





### 4 lane ara micro-architecture





[2] T. Dupuis, Y. Fournier, M. AskariHemmat, N. E. Zarif, F. Leduc-Primeau, J. P. David, and Y. Savaria, "Sparg: A custom risc-v vector processor for efficient sub-byte quantized inference," 2023, Unpublished.

| <b>BOLYTECHN</b><br>Montréal              |                              |                              |                      |                      |
|-------------------------------------------|------------------------------|------------------------------|----------------------|----------------------|
| Ara                                       | Ara<br>ULPPACK               |                              | Sparq                | Sparq v2             |
| FP32<br><b>12.33</b><br>operations/cycle  | (Software<br>only)           | BIT-SERIAL<br>(HW acc.)      | ULPPACK<br>(HW acc.) | ULPPACK<br>(HW acc.) |
| Performance of 2                          | 128 input)                   |                              |                      |                      |
| W1A1                                      | 43.32 (× <mark>3.51</mark> ) | 51.67 (×4.19)                | 48.84 (×3.96)        | Same                 |
|                                           |                              |                              |                      | Nerro,               |
| W2A2                                      | 34.40 (×2.78)                | 30.80 (× <mark>2.50</mark> ) | 37.96 (×3.08)        | mance .              |
|                                           |                              |                              |                      | S, S                 |
| W3A3                                      | 27.57 (× <mark>2.23</mark> ) | N/A                          | 37.96 (×3.08)        | arg (                |
| Results are in oper<br>Simulated using RT |                              |                              |                      |                      |
|                                           |                              |                              |                      |                      |



### MONTRÉAL CORE-V Polara Architecture



Polara is aiming to create a multi-٠ core version of the ARA vectorial processor using the **OpenPiton** framework

The ASIC will posses 4 instances or ARA/CVVEC with 4 lanes each



RISC-V



**CORE-V** Polara

MONTRÉAL OpenPiton Integration



- With help from UC Santa Barbara, a Polara tile is integrated in OpenPiton using an AXI to NOC bridge.
- For the first stage of Integration, initial **RISC-V vector tests** are used for 1 tile configuration.
- Next in our plan is to run these tests in multi tile configuration.



**RISC** 



# FPGA Emulation - Ara RISC-V

Synthesis and implementation completed on Xilinx Alveo U280

Configuration:

- 4 lanes
- 512kB L2 cache
- Max achievable frequency: 75MH

Bitstream generation: Adaptation of the board's constraints in the bitstream









### 

#### IO planning on Xilinx Alveo U280



Adding the constraint file with the IOs increased the WNS from +0.036ns to +0.238ns at 75MHz



### **FPGA Emulation**



### • Next steps:

- Ara
- Programming the FPGA board with Ara
- Run basic tests on Ara
- OpenPiton
- Write Fusesoc script for Polara emulation: synthesis, implementation, generation of bitstream
- Programming the FPGA board with 1 tile of Polara's OpenPiton
- Run tests on Polara





## Montréal ASIC design - Specifications

Table 1: CORE-V Polara specifications

| Name                          |                              |         | CORE-V Polara          |  |  |  |  |
|-------------------------------|------------------------------|---------|------------------------|--|--|--|--|
| Technology                    | GLOBALFOUNDRIES 22FDX FD-SOI |         |                        |  |  |  |  |
| Package                       |                              |         | CPGA208                |  |  |  |  |
| Target frequency <sup>1</sup> |                              |         | $\geq 750 \text{ MHz}$ |  |  |  |  |
| Power <sup>2</sup>            |                              |         | $\leq 1.25 \text{ W}$  |  |  |  |  |
| Width                         |                              |         | $3 \mathrm{mm}$        |  |  |  |  |
| Height                        |                              |         | $3 \mathrm{mm}$        |  |  |  |  |
| Area                          |                              |         | $9 \text{ mm}^2$       |  |  |  |  |
| I/Os                          |                              |         |                        |  |  |  |  |
| $\mathbf{Type}$               | Inputs                       | Outputs | Total                  |  |  |  |  |
| Power                         | 112                          | 0       | 112                    |  |  |  |  |
| VDDC                          | 28                           | 0       | 28                     |  |  |  |  |
| VDDIO                         | 28                           | 0       | 28                     |  |  |  |  |
| VSSC                          | 28                           | 0       | 28                     |  |  |  |  |
| VSSIO                         | 28                           | 0       | 28                     |  |  |  |  |
| Off-Chip Interface (OCI)      | 46                           | 39      | 85                     |  |  |  |  |
| Reset                         | 1                            | 0       | 1                      |  |  |  |  |
| $FLL \ {\mathcal E} \ Clock$  | 4                            | 1       | 5                      |  |  |  |  |
| JTAG                          | 4                            | 1       | 5                      |  |  |  |  |
| Chip bridge (data)            | 37                           | 37      | 74                     |  |  |  |  |
| Total                         | 158                          | 39      | 197                    |  |  |  |  |



- CORE-V Polara is aimed to be tapedout this summer in GF22FDX
- We are aiming for a 3 x 3 mm die at a > 750 MHz clock



RISC-V



#### Montréal ASIC design – Work left to do RISC-V



- The Lanes and ARA macros are tape-out ready
- Currently finalising the OpenPiton Tile macro
- Currently starting working on the Top-Level macro





# MONTRÉAL ASIC design - Macros RISC-V











- Starting to work on the Top Level
- Total of **4** Tiles with the chip bridge and the OpenPiton peripherals
- FLL integration still left to do





# MONTRÉAL POLYTECHNIQUE CALC POR timing results ETH zürich RISC-V

|                                                                   | Lane                    | +                                 |                                      |
|-------------------------------------------------------------------|-------------------------|-----------------------------------|--------------------------------------|
| WNS (ns):<br>TNS (ns):<br>Violating Paths:<br>All Paths:          | -0.031<br>-0.943<br>187 | -0.080  <br>  -58.359  <br>  2905 | -0.024  <br>-4.486                   |
| ++<br>  Hold mode                                                 | Lane                    |                                   |                                      |
| WNS (ns): <br>  TNS (ns): <br>  Violating Paths: <br>  All Paths: | -0.022<br>14            | -0.101                            | -0.000  <br>-0.000  <br>7  <br>31267 |

- Timing results for the current macros
  - Clock @ 750 MHz
  - Some small violations left to fix
  - Violations can be **fixed** using ECO on the critical paths







**Tape-out plan** 



#### MPW2254 :

- Application deadline : 2023-04-10
- Cancelation date : 2023-07-05
- Export Control Date : 2023-07-19
- Design Submission Deadline : 2023-07-19
- Delivery Date : 2024-01-05







# AI runtime



- Usually, To run AI workload on RISC V we need to have a full OS (like Linux) build with support user level vector instructions.
- Or run the AI workload directly on Metal which presents it is own challenges.
- While CVA6 does have an MMU (memory management unit), the Ara vector processor doesn't
- An MMU is necessary for address translation and virtual memory (and running Linux)
- Thus, running RISCV vector code in an OS environment not possible.







### **AI runtime**



- Solution:
  - Build an AI runtime that works directly on baremetal
  - Running baremetal AI not only would work without OS, but it also reduces memory address translation that can result in fewer memory operations, which reduces computation time and power
  - The generic code structure that we build should support a wide array of microcontrollers and processors
- Challenges:
  - Only the basic functionality of C is supported, which means that there is no support for standard C libraries, or even stuff like malloc and other dynamic memory units.







Keras2c

### ETH zürich 🛃 RISC-V®

- Keras2c has two parts:
  - **Code gen**: takes input keras ".h5" network and convert it to C code. Written in python
  - Library files: written in C contain the supported ML ops and different data structures to run generate C code
- Capable to run baremetal environment with little modification
- https://github.com/f0uri est/keras2c









### 









# Keras2c Riscv



- The generated binary can work in verilator which is a cycle accurate emulator,
- It is producing the expected output which is the exact output if ran natively on X86 linux, or Spike RISCV elf







# Polara-keras2c



- rewriting some of the function of Keras2c to work for baremetal with custom implementation to work for baremetal (Memcpy, memset, strlen, strnlen, strcmp, strcpy, printf...)
- Updated code generation to remove calls incompatible with Polara baremetal environment
- Added support for some operations (conv2d, relu, add, dense, padding) for safe fixed-point implementation
- The remaining layers are still implemented in floating point
- The library files can be replaced with an optimized implementation (for example RISCV vector code)





# OLYTECHNIQUE Preliminary result of inference RISC-V®

- 97% of cycles are spent executing conv2d
- Optimizing for conv2d
- Implemented RVV int8 vector to speedup the convolution, maxpool and Relu
- Ara vector is close to IPC performance of 12<sup>th</sup> gen intel core using AVX instructions at much lower energy use (simulation)









### **Future work**



- Finish RTL for the sparq v2
- FPGA implementation for Polara
- Validation testing for Polara
- Fix minor timing issues with Polara
- Full stack implementation of polara-keras2c with proper scaling and shifting factors for low-precision neural network







## Conclusion



- Work on Multicore Ara is progressing nicely
- Chip is expected back in 2024
- High-performance and power efficiency using RISCV vector









# Thank you!