For today's post, we'll delve into development tools.
The primary development tool used in our experiments has been the aoc compiler. This compiler is the cornerstone of our SDK, allowing us to generate an FPGA design from an OpenCL kernel. It also generates a significant number of files containing information about our design, which will be invaluable in the optimization process.
The aoc compiler enables us to generate a FPGA configuration called a "design", written in hardware description language (HDL) from the OpenCL kernel and pragmas added for its attention. This phase represents a substantial amount of work that can result in a compilation duration of several hours: here, the task is to map all the kernel instructions onto the hardware resources (logic gates, memory blocks, digital signal processors, etc.).
The aoc compiler essentially generates a .aocx binary from a .cl file. Useful flags can be passed to the compiler to analyze the generated design and potentially optimize it later. Our typical compilation command, along with the effect of the compilation flags, is described below:
>: aoc - v - - board s5phq_d8 - - report - - profile kernel.cl - o design.aocx
-v | verbosity (success/failure of various phases of the compilation process) |
--board | specifies the targeted architecture (here a Stratix V board, s5phq_d8) |
--report | generates a file specifying optimizations made on the kernel, provides an estimate of hardware usage during compilation |
--profile | instruments the design to enable timing and profiling |
At the end of compilation, a folder containing numerous files describing our design is produced. All of these files are intended to be read by the Quartus II tool, which we had access to. This tool seems extremely rich, and we did not have time to fully explore its capabilities; however, the files that seemed most interesting to us can be read in plain text with any text editor: they include a compilation log and an area report providing information about the amount of hardware used by our design on the targeted FPGA board. These files will be useful in the optimization process, and a preview of their "raw" content is available below.
We do note, however, that the compilation log (shown in the bottom right of the previous image) is not as comprehensive as we expected, as presented in the Altera documentation for "single-workitem" kernels. This documentation provided information for different parts of the kernel, indicating correctly unrolled loops (or not), and information on the efficiency of our memory transactions.
Here, we find the terminal output, with a compilation log that does not seem to contain interesting information, leaving us wanting more. A more comprehensive log would undoubtedly be more useful to guide us in the profiling and optimization phases!
The --profile compilation flag automatically generates a profile.mon file after running the FPGA application. This file contains a set of information about the kernel execution, which can be visualized through a graphical interface invoked by aocl, as follows:
>: aocl - - report /path_to_design/file.aocx profile.mon
The tabs present in the interface include the kernel execution timeline and all lines of code from the kernel, with several attached metrics.
In the Kernel execution tab, the presented timeline is rather simplistic compared to Nvidia tools on GPUs. It provides the kernel execution time and can specify read and write times to global memory if the ACL_PROFILE_TIMER environment variable was set to 1 before running the application.
However, a significant amount of information is available in the source code tab, containing all kernel code lines, along with performance metrics attached. This allows visualization of instructions associated with design bottlenecks (pipeline stalling, inefficient memory reads/writes, etc.).
Attribute | Information about the memory area targeted by the operation (global, local, read/write, DDR/QDR, etc.) |
Stall % | Percentage of time during which the instruction slows down/blocks the pipeline |
Occupancy % | Percentage of time during which the instruction is executed by at least one work item |
Bandwidth | Bandwidth used by the read/write operation, and its efficiency |
In particular, the stall metric should help determine whether it's important to unroll a loop or not, optimize, or reformulate a certain operation, while bandwidth can indicate inefficient/non-vectorized memory transactions.
The aoc compiler is the cornerstone of this SDK and produces a significant amount of information accompanying the FPGA design generated from the OpenCL kernel. This information (board usage, compilation log, etc.) can be visualized with the Quartus II tool, but as newcomers, we are overwhelmed by the information and haven't had time to delve into its real capabilities. Additionally, our subsequent attempts did not allow us to obtain a compilation log as detailed as that presented in the Altera documentation, which would have been very useful.
The optimization process is inseparable from compilation, enabling the instrumentation of the design and providing instruction-level metrics, as well as visualizing the kernel execution timeline through a GUI invoked by aocl.
We will continue in the next post with a presentation of the AES 128-bit encryption algorithm, which will be our practical case guiding the rest of the study.