

#### Webinar III

## HPP-Heterogeneous Processing Platform Computer Vision Using OpenCV/OpenCL Targeting the HPP

Yassine Hariri Senior Engineer, Platform Design Hariri@cmc.ca Hugh W. Pollitt-Smith Senior System Design Engineer Pollitt-smith@cmc.ca

### **Agenda**



- Overview
- emSYSCAN Development Systems Update
- HPP: Heterogeneous processing platform
- Computer Vision challenges and opportunities
- OpenCV
- HPP GPU
  - OpenCV and CUDA
  - Live demo: OpenCV/CUDA targeting the HPP
- HPP FPGA
  - OpenCL
  - Live demo: OpenCL/OpenGL targeting the HPP
- Q&A



## EMSYSCAN DEVELOPMENT SYSTEMS UPDATE

# 

Microsystems
Rapid-Prototyping,
Characterization
and Integration Labs

- 4 Universities:
- UBC
- U Manitoba
- Queen's
- École Polytechnique





#### Multi-Technology Design Environment

- System architecture exploration
- Multi-technology simulation
- Design of custom devices for manufacturing

#### **Development Systems**

 System validation and proof-of-concept demonstration



Real-Time Embedded Software Lab

University of Waterloo

 Design, analysis, debug of real-time software on nextgeneration processor systems



Common, Shared Platforms Interconnected Community of Users Knowledge Repository Centralized Management & Operations



License Management Server (LMS)

## Development Systems for Proof of Concept













Images courtesy of National Instruments, Xilinx, BEECube,, NVIDIA

## National Research Platform: Enriched Projects; Results Sooner



- Common set of programmable research platforms with proof of concept features
- Pooled equipment timeshared among users
- Sharing of knowledge on equipment usage
- Adaptive over time in terms of equipment quantities and equipment features
- Large community of users, institutions
- <u>Leveraged</u> industrial partners (e.g., STMicro.)

National project scope and sizeable outcomes enabled by centralized project implementation and management by CMC Microsystems

## Installation and usage



- Shared access systems can be accessed at no charge but require Designer level subscription
  - Subscription provides access to support, tools, reference designs, forums, workshops, travel, select/swap, training, additional discounts
- Systems delivered on site, remote access
- Designated Development System coordinator(s) at each site
  - Communicate institutional needs for purchase specifications
  - Local advocate, information source
  - Encourage participation in National Project

## emSYSCAN Development Systems delivered (Gen1)



- Embedded Systems Platform:
  - Xilinx ML605, Altera DE4-530
- Advanced Processing Platform
  - BEEcube BEE3, BEE4, miniBEE
- Software-Defined Radio Platform
  - BEEcube miniBEE, RF daughtercard
- Simulation Acceleration Platform
  - Nallatech P385-D5 (Altera Stratix V, OpenCL)
- Multiprocessor Array Platform
  - NVIDIA Tesla K20 GPU
  - Intel Xeon Phi
- Microsystems Integration Platform
  - National Instruments PXI-based, FPGA, MEMS, microfluidics, RF, photonics features

## NDN Development Systems Community



https://community.cmc.ca/community/development-systems



## Upcoming emSYSCAN Development Systems deployments

- Embedded Systems Platform
  - Xilinx Virtex-7, Ultrascale, Zynq options
  - Altera Arria 10 and Arria 10 SoC options
  - Shipping Q1 2016
- Advanced Processing Platform
  - RFP currently in evaluation
  - Shipping Q1/Q2 2016
- Software-defined Radio
  - BEEcube nano/megaBEE (2x2 up to 16x16 MIMO options)
  - Shipping Q1 2016

## emSYSCAN Wireless Sensor Network Kits



- For researchers interested in topics related to the Internet of Things, healthcare, smart sensor systems & algorithms & wireless communications.
- Zigbee and Bluetooth based components including:
  - IAR Embedded Workbench for ARM Cortex M
  - Mpression Odyssey MAX 10 FPGA with Bluetooth
  - Freescale Freedom, Beaglebone, Raspberry Pi,
  - CC2538DK, CC2650STK, programmers
  - 10-port USB Industrial Charger (power source)

#### Timeline:

- IAR Embedded Workbench for ARM Cortex-M, dev kits,
   & programmers delivered Jan 7, 2016 to 14 universities
- Zigbee connectivity documentation, end of Feb 2016.

#### Details at:

www.cmc.ca | Products & Services |
 Development Systems | Sensor Platforms | Wireless Sensor Network Kits





#### Canada's National Design Network – ADEPT Management & Operations

Includes software procurement, configuration, installation and delivery. Access and utilization management, engineering/technical support. Cybersecurity installations, secure testbed assistance and demonstrations. trainer events. Advisory Group coordination. Governance, reporting, legal and financial administration.

Train-the-

### **HPP Distribution**



- Based on Development Systems Coordinator consultations in April 2014:
  - Generation 1 (2014/15): 18 systems
    - USask, UQTR, Outaouais, McGill, York, Windsor, Waterloo, Western, Ottawa, Ryerson, RMC, Victoria
  - Generation 2 (2016/17): 12 systems
    - Memorial, Guelph, McMaster, Toronto, Polytechnique, UQTR, Outaouais



# HPP: HETEROGENEOUS PROCESSING PLATFORM

## **HPP Project status**



**Status** (12 universities selected 18 systems G1, 7 universities selected 8 G2)

o Assembled, cloned, tested and shipped 18/18 units

#### **Tutorials and reference designs**

- o Quick start guide : Heterogeneous Parallel Platform (HPP) Available online
- o User Guide: Performance and Power profiling for the HPP Release Date: early Feb. 2016
- o Computer vision using OpenCV/OpenCL targeting the HPP- Heterogeneous Processing Platform Release Date: early Feb. 2016

#### Webinars series for the HPP

- o Introduction to the HPP-Heterogeneous Parallel Platform: A combination of Multicores, GPUs, FPGAs and Many-cores accelerators (August 26<sup>th</sup>) Available online
- o Programming models, performance and power profiling for the HPP-Heterogeneous Parallel Platform (December 2<sup>nd</sup>) Available online Available online
- o Computer vision using OpenCV/OpenCL targeting the HPP- Heterogeneous Processing Platform (January 13<sup>th</sup>)

## **HPP** fully installed system





## **Computer Vision**



- What is Computer Vision?
  - "Computer vision is the transformation of data from a still or video camera into either a decision or a new representation"
  - "Learning OpenCV: Computer Vision with the OpenCV Library Paperback Oct 4 2008
     by Gary Bradski (Author), Adrian Kaehler (Author)
- Computer vision is a rapidly growing field for many reasons:
  - High quality Cameras availability and low cost,
  - Increasing processing power,
  - Mature computer vision algorithms.
- Various applications of computer vision, including:
  - objects detection, inspection and tracking, lane tracking and pedestrian detection in automotive, robotic systems, video surveillance, biometrics, augmented reality gaming, new user interfaces, and many more.

## Computer Vision challenges and opportunities



#### Challenges

- Compute-intensive, therefore require high compute capabilities.
- Many computer-vision scenarios must be executed in real-time
  - The processing of a single frame ~30-40 milliseconds.
- In the embedded space, Power consumption is a huge issue

#### **Opportunities**

- Build efficient computing architectures
  - More transistors per area,
  - Make a good use of these transistors
    - Parallelization: creating more identical processing units.
    - Specialization: building domain-specific hardware accelerators.
- Heterogeneous parallel computing
  - The concept of combining these two ideas leads to, heterogeneous computing combining many-cores CPUs, GPUs, FPGAs together with various accelerators.

## Nature is massively parallel





## Heterogeneous systems





## Computer Vision and Heterogeneous Computing



- Many high-level tasks consist of both parallel and serial subtasks:
  - Parallel subtasks: "embarrassingly parallel," because they are so easy to parallelize
    - Ex. rendering or filtering pixels.
  - Serial subtasks: do not parallelize easily, as they contain serial segments where the
    results of the later stages depend on the results of earlier stages.
    - Ex. Many iterative numerical optimization algorithms, stack-based tree-search algorithms
- Heterogeneous approach for computer vision applications
  - Run "embarrassingly parallel," tasks on the GPU or FPGA
  - Run sequential tasks on the multi-core CPU or FPGA
- Key challenges (Research direction)
  - Synchronization between dependent tasks
  - The overhead of moving the data between the different processing units

### **Heterogeneous Systems Architecture**





## **OpenCV**



- Open source library for computer vision, image processing and machine learning
- Permissible BSD license
- Freely available (www.opencv.org)
- Portability
- Real-time computer vision (x86 MMX/SSE, ARM NEON, CUDA)
- C (11 years), now C++ (3 years since v2.0), Python and Java
- Windows, OS X, Linux, Android and iOS



## **Usage**



#### Usage

- >6 million downloads, > 47,000 user group
- Google, Yahoo, Microsoft, Intel, IBM, Sony, Honda, Toyota

#### Applications

- Street view image stitching
- Automated inspection and surveillance
- Robot and driver-less car navigation and control
- Medical image analysis
- Video/image search and retrieval
- Movies 3D structure from motion
- Interactive art installations

## **Functionality**



- Image/video I/O, processing, display (core, imgproc, highgui)
- Object/feature detection (objdetect, features2d, nonfree)
- Geometry-based monocular or stereo computer vision (calib3d, stitching, videostab)
- Computational photography (photo, video, superres)
- Machine learning & clustering (ml, flann)
- CUDA and OpenCL GPU acceleration (gpu, ocl)



# HPP: GPU OPENCV + CUDA

## Installation of OpenCV on the HPP



- Create a directory for OPENCV
  - mkdir OPENCV
  - cd OPENCV/
- Getting the Cutting-edge OpenCV from the Git Repository
  - git clone https://github.com/ltseez/opencv contrib.git
  - cd opencv
- Create a temporary directory where you want to put the generated Makefiles, project files as well the object files and output binaries:
  - mkdir release
  - cd release/
- Configuration:
  - cmake -D CMAKE\_BUILD\_TYPE=RELEASE -D CMAKE\_INSTALL\_PREFIX=/usr/local ...
- Build:
  - make
  - sudo make install
- Add OpenCV CUDA libraries
  - cmake -D WITH\_CUDA=ON -D WITH\_TBB=ON -D BUILD\_NEW\_PYTHON\_SUPPORT=ON -D WITH\_V4L=ON -D
     INSTALL\_C\_EXAMPLES=ON -D INSTALL\_PYTHON\_EXAMPLES=ON -D BUILD\_EXAMPLES=ON -D WITH\_QT=ON
     -D WITH\_OPEN\_GL=ON -D WITH\_vtk=ON ..
  - sudo make
  - sudo make install

# CMC

### OpenCV GPU Module

- The GPU module is designed as a host-level API
- Set of classes and functions to achieve the best performance with GPUs
- Implemented using NVIDIA\* CUDA\* Runtime API and supports only NVIDIA GPUs
- Maintain conceptual consistency with the current CPU functionality
- The OpenCV GPU module includes:
  - Utility functions and Low-level vision primitives:
    - Provide a powerful infrastructure for developing fast vision algorithms taking advantage of GPU
  - High-level algorithms:
    - State-of-the-art algorithms (such as stereo correspondence, face and people detectors, and others).

## **OpenCV CPU example**



```
#include<opencv2/opencv.hpp>
using namespace cv;
int main()
{
    Mat src, dst;
    src = imread("test.jpg", 0);
    bilateralFilter(src, dst, -1, 50, 7);
    Canny(dst, dst, 0, 30, 3);
    imwrite("out.png", dst);
}
```

←OpenCV header files←OpenCV C++ namespace

←Allocate a temp

←Load an image file as grayscale

←Filter the image

←Find the edges, drawn as white pixels

**←Store to an image file** 

## **OpenCV CPU example**



```
#include<opencv2/opencv.hpp>
using namespace cv;
int main()
{
    Mat src, dst;
    src = imread("test.jpg", 0);
    bilateralFilter(src, dst, -1, 50, 7);
    Canny(dst, dst, 0, 30, 3);
    imwrite("out.png", dst);
}
```



### **OpenCV CUDA example**



```
#include <opency2/opency.hpp> Copency GPU header file
#include <opencv2/gpu/gpu.hpp>
using namespace cv;
int main() {
Mat src = imread("test.jpg", 0);
if (!src.data) exit(1);
gpu::GpuMat d_src(src); ←Upload image from CPU to GPU memory
gpu::GpuMat d dst; ←Allocate a temp output image on the GPU
gpu::bilateralFilter(d src, d dst, -1, 50, 7); ←Process images on the GPU
gpu::Canny(d dst, d dst, 0, 30, 3); ←Process images on the GPU
Mat dst(d dst); ←Download image from GPU to CPU mem
imwrite("out.png", dst);
return 0:
```



### **OpenCV CUDA example**

```
#include <opencv2/opencv.hpp>
#include <opencv2/gpu/gpu.hpp>
using namespace cv;
int main() {
Mat src = imread("car1080.jpg", 0);
if (!src.data) exit(1);
gpu::GpuMat d_src(src);
gpu::GpuMat d_dst;
gpu::bilateralFilter(d_src, d_dst, -1, 50, 7);
gpu::Canny(d_dst, d_dst, 35, 200, 3);
Mat dst(d_dst);
imwrite("out.png", dst);
return 0;
```



### **GPU-accelerated Computer Vision**

http://docs.opencv.org/2.4/modules/gpu/doc/gpu.html



- GPU Module Introduction
- Initalization and Information
- Data Structures
- Operations on Matrices
- Per-element Operations
- Image Processing
- Matrix Reductions
- Object Detection
- Feature Detection and Description
- Image Filtering
- Camera Calibration and 3D Reconstruction
- Video Analysis

## **Building and running OpenCV CUDA example**



- Source Codes are available in: opency/samples/gpu
- Building and running the hog.cpp
  - 1. Save the original CMakeLists.txt to CMakeLists original.txt
  - 2. Edit CMakeLists.txt
  - 3. Compile, link and build
  - 4. Run



[root@HPPPrototype gpu]# ./HogCode road.png --camera 0 Device 0: "Tesla K20c" 4800Mb, sm\_35, 2496 cores, Driver/Runtime ver.7.50/7.0

Controls:

ESC - exit

m - change mode GPU <-> CPU

g - convert image to gray or not

1/q - increase/decrease HOG scale

2/w - increase/decrease levels count

3/e - increase/decrease HOG group threshold

4/r - increase/decrease hit threshold

cmake\_minimum\_required(VERSION 2.8)
project(HogCode)
find\_package(OpenCV REQUIRED)
add\_executable(HogCode hog.cpp)
target link libraries(HogCode \${OpenCV LIBS})

[root@HPPPrototype gpu]# cmake.

- -- The C compiler identification is GNU 4.4.7
- -- The CXX compiler identification is GNU 4.4.7
- -- Check for working C compiler: /usr/bin/cc
- -- Check for working C compiler: /usr/bin/cc -- works
- -- Detecting C compiler ABI info
- -- Detecting C compiler ABI info done
- -- Check for working CXX compiler: /usr/bin/c++
- -- Check for working CXX compiler: /usr/bin/c++ -- works
- -- Detecting CXX compiler ABI info
- -- Detecting CXX compiler ABI info done
- -- Found CUDA: /usr/local/cuda-7.0 (found suitable exact version "7.0")
- -- Configuring done
- -- Generating done
- -- Build files have been written to: /root/OPENCV/opencv/samples/gpu [root@HPPPrototype gpu]# make

Scanning dependencies of target HogCode

[100%] Building CXX object CMakeFiles/HogCode.dir/hog.cpp.o

Linking CXX executable HogCode



## **Live Demo**

**HPP: GPU** 

OPENCV + CUDA



# HPP: FPGA OPENCL+OPENGL

## OpenCL vs. OpenGL



- Unified programming model
  - More accessible to the software developers
  - Host CPU-FPGA communication
  - C based programming language
  - Memory Hierarchy auto generated
  - Potential to integrate existing VHDL/Verilog IP
- Open Graphics Library (OpenGL)
  - Cross-language, cross-platform application programming interface (API) for rendering <u>2D</u> and <u>3D</u> vector graphics.
  - The API is typically used to interact with a graphics processing unit (GPU), to achieve hardwareaccelerated rendering.

## **OpenCL for FPGA**





White Paper: Implementing FPGA Design with the OpenCL Standard (Figure 4)

## **OpenCL Overview**





White Paper: Implementing FPGA Design with the OpenCL Standard (Page: 7)

## AOCL design flow steps



#### 1. Intermediate compilation (aoc –c [-g] <your\_kernel\_filename>.cl)

- Checks for syntatic errors
- Generates a .aoco file without building the hardware configuration file
- Generate estimated resource usage summary <your\_kernel\_filename>.log

#### 2. Emulation (aoc -g <your\_kernel\_filename>.cl)

- The AOCL Emulator generates a .aocx file that executes on x86-64 Windows or Linux host
- Assess the functionality of your OpenCL kernel

#### 3. Profiling (aoc --profile <your\_kernel\_filename>.cl)

- aocl report <your\_kernel\_filename>.aocx profile.mon
- Instruct the Altera Offline Compiler to instrument performance counters in the Verilog code in the .aocx file
- During execution, the performance counters collect performance information which you can then review in the Profiler GUI.

#### 4. Full deployment

Execute the .aocx file on the FPGA

## Sobel Filter Design Example



```
int main(int argc, char **argv)
                                                                               Initialize OpenGL
 imageFilename = (argc > 1) ? argv[1] : "butterflies.ppm";
 initGL(argc, argv);
 initCL();
                                                                               Initialize OpenCL
 input = (cl_uint*)alignedMalloc(sizeof(unsigned int) * rows * cols);
 output = (cl_uint*)alignedMalloc(sizeof(unsigned int) * rows * cols);
                                                                             Allocate Buffers on the HOST
 // Read the image
 if (!parse_ppm(imageFilename.c_str(), cols, rows, (unsigned char *)input)) {
    std::cerr << "Error: could not load " << argv[1] << std::endl;
                                                                                 Parsing the image
   teardown();
 std::cout << "Commands:" << std::endl:
 std::cout << " <space> Toggle filter on or off" << std::endl;
 How to control de app.
 std::cout << " =" << std::endl
                               " Reset filter threshold to default" << std::endl;</p>
 std::cout << " q/<enter>/<esc>" << std::endl
                                            << " Quit the program" << std::endl;
 glutMainLoop(); <
 teardown(0);
                                       GLUT event processing loop
                                       Free the resources
```

### initGL



```
void initGL(int argc, char **argv)
{
    ***
    glutWindowHandle = glutCreateWindow("Filter");
    ***
    glutKeyboardFunc(keyboard);
    glutDisplayFunc(display);
    glutReshapeFunc(reshape);
    glutIdleFunc(idle);
    glutTimerFunc(REFRESH_DELAY, timerEvent, 0);
    ***
}
```

```
void display()
{
    glClear(GL_COLOR_BUFFER_BIT);
    if (useFilter) {
        filter(output);
        glDrawPixels(cols, rows, GL_RGBA, GL_UNSIGNED_BYTE, output)
    } else {
        glDrawPixels(cols, rows, GL_RGBA, GL_UNSIGNED_BYTE, input);
    }

****
}
```

### initCL



```
void initCL()
  platform = findPlatform("Altera");
  status = clGetDevicelDs(platform, CL DEVICE TYPE ALL, 1, &device, NULL);
                                                                                            Initialize OpenCl
  clGetDeviceInfo(device, CL DEVICE NAME, sizeof(info), info, NULL);
  context = clCreateContext(0, num_devices, &device, NULL, NULL, &status);
  queue = clCreateCommandQueue(context, device, 0, &status);
                                                                                 Create Command Queue
  std::string binary file = getBoardBinaryFile("sobel", device);
  std::cout << "Using AOCX: " << binary file << "\n";
  program = createProgramFromBinary(context, binary file.c str(), &device, 1);
                                                                                                 Create Kernel
  status = clBuildProgram(program, num_devices, &device, "", NULL, NULL);
  kernel = clCreateKernel(program, "sobel", &status);
  in buffer = clCreateBuffer(context, CL MEM READ ONLY, sizeof(unsigned int) * rows * cols, NU
  out buffer = clCreateBuffer(context, CL MEM WRITE ONLY, sizeof(unsigned int) * rows * cols, N
  int pixels = cols * rows;
  status = clSetKernelArg(kernel, 0, sizeof(cl mem), &in buffer);
  status |= clSetKernelArg(kernel, 1, sizeof(cl mem), &out buffer);
                                                                                          Setup Kernel Args.
  status |= clSetKernelArg(kernel, 2, sizeof(int), &pixels);
  checkError(status, "Error: could not set sobel args");
```

### filter



```
void filter(unsigned int *output)
                                                                                                                copy data into device
  ***
  status = clEnqueueWriteBuffer(queue, in buffer, CL FALSE, 0, sizeof(unsigned int) * rows * cols, input, 0, NULL, NULL);
  ***
                                                                                                                set sobel threshold
  status = clSetKernelArg(kernel, 3, sizeof(unsigned int), &thresh),
                                                                                                                enqueue sobel filter
  status = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &sobelSize, &sobelSize, 0, NULL, &event),
  status = clFinish(queue);
                                                                 Blocks until queued OpenCL commands are issued and have
                                                                 completed.
clReleaseEvent(event);
  status = clEnqueueReadBuffer(queue, out_buffer, CL_FALSE, 0, sizeof(unsigned int) * rows * cols, Output, 0, NULL, NULL);
  checkError(status, "Error: could not copy data from device");
                                                                                                               copy data from device
  status = clFinish(queue);
  checkError(status, "Error: could not successfully finish copy");
```



## **Live Demo**

HPP: FPGA
OPENCL+OPENGL



Q&A

Yassine Hariri Hariri@cmc.ca