Vivado High-Level Synthesis Tutorial(UG871)

The following tutorials explain and demonstrate all steps in the process of transforming C, C++ and SystemC code to an RTL implementation using High-Level Synthesis. These show how to create an initial RTL implementation and then transform it into both a low-area and high-throughput implementation by using optimization directives without changing the C code.

Reference for this sections: Vivado Design Suite Tutorial High-Level Synthesis UG871

High-Level Synthesis Introduction

This tutorial introduces Vivado High-Level Synthesis (HLS). The primary tasks for performing High-Level Synthesis using both the Graphical User Interface (GUI) and Tcl environments are demonstrated.

Lab 1: Creating a High-Level Synthesis Project

This lab explains how to set up a High-Level Synthesis (HLS) project and perform all the major steps in the HLS design flow:

  1. Creating a New Project:

    1. Open HLS GUI

    2. Enter project details (Name, Location)

    3. Add source files

    4. Add test bench files

    5. Generate Solution (specify hardware details clock period, FPGA part, clock uncertainty)

  2. Validate the C code (C Validation or C Simulation): Aim of this step is to confirm that the C code is correct. This process is also called C Validation or C Simulation.

  3. High-Level Synthesis: Synthesize the C design into an RTL design and review the synthesis report.

  4. RTL Verification: High-Level Synthesis can re-use the C test bench to verify the RTL using simulation.

  5. IP Creation: To package the design as an IP block for use with other tools in the Vivado Design Suite.

HLS Graphical User Interface (GUI)

Regions in the Graphical User Interface (GUI) and their functions are:

Explorer Pane

: Shows the project hierarchy.

Information Pane

: Shows the contents of any files opened from the Explorer pane along with report files.

Auxiliary Pane

: Cross-links with the Information pane.

Console Pane

: Shows the messages produced when Vivado HLS runs. Errors and warnings appear in Console pane tabs.

Toolbar Buttons

: Each button also has an associated menu item available from the pull-down menus to perform the most common operations. When you hold the cursor over the button, a popup tool tip opens, explaining the function.

Perspectives

: Provide convenient ways to adjust the windows within the Vivado HLS GUI.

Synthesis Perspective

:   The default perspective allows you to synthesize designs, run
    simulations, and package the IP.

Debug Perspective

:   Includes panes associated with debugging the C code.

Analysis Perspective

:   Windows in this perspective are configured to support analysis
    of synthesis results.

Usual self check operation of test bench

  • The test bench saves the output in a file.

  • That file is compared with the expected results stored in file.

  • If the output matches the expected results, the return value of the test bench main() is set to 0. Else, the return value of main() is set to 1.

  • If the test bench is self-checking, then the RTL results are automatically checked during RTL verification.

  • The Vivado HLS tool can reuse the C test bench to perform verification of the RTL.

  • There is no requirement to create an RTL test bench. This provides a robust and productive verification methodology.

C synthesis report

  • If the total latency is one clock cycle greater than the loop latency then,this indicates that the design is not pipelined.

  • Vivado HLS targets a clock period of Clock Target minus Clock Uncertainty.

Performance Estimates pane

: Performance Estimates pane enlists clock period, clock uncertainty, latency, Initiation Interval.

Utilization Estimates pane

: Utilization Estimates pane tries to estimate the resource utilization numbers. These estimations might change post RTL synthesis with additional optimizations.

Interface pane

: The Interface section shows the ports and I/O protocols created by interface synthesis. HLS automatically adds a clock and reset port along with a few interface ports.

RTL co-simulation

The default option for RTL co-simulation is to perform the simulation using the Vivado simulator and Verilog RTL.

Lab 2: Using the Tcl Command Interface

This lab exercise shows how to create a Tcl command file based on an existing Vivado HLS project and use the Tcl interface:

  1. Open the Vivado HLS Command Prompt

  2. Generate a tcl file (For reference: Lab1 script.tcl)

  3. Run "vivado_hls --f run_hls.tcl" in the the Vivado HLS Command Prompt (within required directory).

Important points to note

  • When a Vivado HLS project is created, Tcl files(script.tcl & directives.tcl) are automatically generated in the project hierarchy.

  • The file script.tcl contains the Tcl commands to create a project with the files specified during the project setup and run all stages of the HLS flow.

  • The file directives.tcl contains any optimizations applied to the design solution.

Lab 3: Using Solutions for Design Optimization

This lab shows you how to optimize the design using optimization directives. It also creates multiple versions of the RTL implementation and compares the different solutions.

Design goals:

  • Create a version of this design with the highest throughput.

  • The final design should be able to process data supplied with an input valid signal.

  • Produce output data accompanied by an output valid signal.

  • The filter coefficients are to be stored externally to the FIR design, in a single port RAM.

  1. Creating a New Project.

  2. Optimize the I/O Interfaces: The type of I/O protocol affects the design optimizations possibilities. Add directives from the Auxiliary Pane. Create multiple solutions with different optimizations.

  3. Analyze the Results: Using Analysis perspective, performance and resources can be observed, analyzed.

  4. Optimize for the Highest Throughput (Lowest Interval): Unroll rolled, partition block RAM into individual registers.

  5. Compare the results of different solutions.

  • If there is an I/O protocol requirement, setting the I/O protocol should be done as early as possible in the design cycle.

  • Control states are the internal states High-Level Synthesis uses to schedule operations into clock cycles. There is a close correlation between the control states and the final states in the RTL Finite State Machine (FSM), but there is no one-to-one mapping.

  • With the insight gained through analysis, you can proceed to optimize the design.

The two issues that limit the throughput are:

  • Rolled loops

  • Use of block RAM instead of a shift-register

C Validation

Validation of the C algorithm is an important part of the High-Level Synthesis (HLS) process. The time spent ensuring the C algorithm is performing the correct operation and creating a C test bench, which confirms the results are correct, reduces the time spent analyzing designs that are incorrect "by design" and ensures the RTL verification can be performed automatically.

The sample design used in this tutorial is a Hamming Window FIR. There are three versions of this design:

  • Using native C data types.

  • Using ANSI C arbitrary precision data types.

  • Using C++ arbitrary precision data types.

There are no design goals for this tutorial.

Lab 1: C Validation and Debug

Reviews the aspects of a good C test bench, the basic operations for C validation and the C debugger.

  1. Creating a New Project.

  2. Review Test Bench and Run C Simulation

  3. Run the C Debugger

Good practices for writing a testbench

  • The test bench creates and stores a set of expected results that confirm the function is correct.

  • The test bench asks the Design Under Test (DUT) to generate and store results.

  • The actual and expected results are compared. If the comparison fails, the value of variable err_cnt is set to a non-zero value.

  • The test bench issues a message to the console if the comparison failed, but more importantly returns the results of the comparison.

  • This process of checking the results and returning a value of zero if they are correct automates RTL verification.

Lab 2: C Validation with ANSI C Arbitrary Precision Types

Uses a design with arbitrary precision C types for C Validation.

  1. Creating a New Project

  2. Run the C Debugger with Launch Debugger: Simulation is not completed as arbitrary precision types are not supported in debug mode.

  3. Run the C Debugger without Launch Debugger: Run is successful.

IMPORTANT: When working with arbitrary precision types you can use the Vivado HLS debug environment only with C++ or SystemC. When using arbitrary precision types with ANSI C,the debug environment cannot be used. With ANSI C, you must instead use printf or fprintf statements for debugging.

Lab 3: C Validation with C++ Arbitrary Precision Types

Uses a design with arbitrary precision C++ types for C Validation.

  1. Creating a New Project

  2. Run the C Debugger with Launch Debugger: Debugger is used to observe the arbitrary precision types being populated and utilized.

Arbitrary precision types are a powerful means to create high-performance, bit accurate hardware designs.

Interface synthesis

Interface synthesis is the process of adding RTL ports to the C design. In addition to adding the physical ports to the RTL design, interface synthesis includes an associated I/O protocol, allowing the data transfer through the port to be synchronized automatically and optimally with the internal logic.

Lab 1: Block-Level I/O Protocols

Review the function return and block-level protocols.

  1. Creating a New Project

  2. Create and Review the Default Block-Level I/O Protocol

  3. Modify the Block-Level I/O protocol

Important points to be noted

  • A clock and reset (single-bit inputs) have been added to the design by HLS.

  • A block-level I/O protocol has been added to control the RTL design.

Block level I/O Protocol

The block-level I/O protocol allows the RTL design to be controlled by additional ports independently of the data I/O ports. This I/O protocol is associated with the function itself and not with any of the data ports. The default block-level I/O protocol is called ap_ctrl_hs (the Control Handshake protocol).

Behavior of the signals for block-level I/O protocol ap_ctrl_hs: Insert table here

  • The design does not start operation until ap_start is set to logic 1.

  • The design indicates it is no longer idle by setting port ap_idle low.

  • Output signal ap_ready goes high to indicate the design is ready for new inputs on the next clock.

  • Output signal ap_done indicates when the design is finished and that the value on output port ap_return is valid.

  • If ap_start is held high, the next transaction will start on the next clock cycle.

  • In addition, the RTL cosimulation feature requires a block-level I/O protocol to sequence the test bench and RTL design for cosimulation automatically.

  • Cosim only supports the following 'ap_ctrl_none' designs:

    1. Combinational designs

    2. Pipelined design with task interval of 1

    3. Designs with array streaming or hls_stream ports

Types of the block-level interface protocol

There are four types of the block-level interface protocol:

  1. ap_ctrl_none: No block-level I/O control protocol. When the interface protocol ap_ctrl_none is used, no block-level I/O protocols are added to the design. The only ports are those for the clock, reset and the data ports.

  2. ap_ctrl_hs: The block-level I/O control handshake protocol. This protocol is associated with the function return value (this is true even if the function has no return value specified in the code).

  3. ap_ctrl_chain: The block-level I/O protocol for control chaining. This I/O protocol is primarily used for chaining pipelined blocks together. In addition to the ap_ctrl_hs protocol but with an additional input signal, ap_continue, which must be high when ap_done is asserted for the next transaction to proceed. This allows downstream blocks to apply back-pressure on the system and halt further processing when they are unable to continue accepting new data.

  4. s_axilite: Can be applied in addition to ap_ctrl_hs or ap_ctrl_chain to implement the block-level I/O protocol as an AXI Slave Lite interface in place of separate discrete I/O ports.

Lab 2: Port I/O Protocols

Understand the default I/O protocol for ports and learn how to select an I/O protocol.

  1. Creating a New Project

  2. Specify the I/O Protocol for Ports

Important points to be noted

  • The code does not have a function return, but instead passes the output of the function through the pointer argument.

  • The types of I/O protocol that you can add to C function arguments by interface synthesis depends on the argument type.

  • Pass-by-value arguments can be implemented with the following I/O protocols:

    1. ap_none: No I/O protocol. This is the default for inputs.

    2. ap_stable: No I/O protocol.

    3. ap_ack: Implemented with an associated output acknowledge port.

    4. ap_vld: Implemented with an associated input valid port.

    5. ap_hs: Implemented with both input valid and output acknowledge ports.

  • Pass-by-reference arguments that can be implemented with the following I/O protocols:

    1. ap_none: No I/O protocol. This is the default for inputs.

    2. ap_stable: No I/O protocol.

    3. ap_ack: Implemented with an associated input acknowledge port.

    4. ap_vld: Implemented with an associated output valid port. This is the default for outputs.

    5. ap_ovld: Implemented with an associated output valid port (no valid port for the input part of any inout ports).

    6. ap_hs: Implemented with both input valid port and output acknowledge ports

    7. ap_fifo: A FIFO interface with associated output write and input FIFO full ports.

    8. ap_bus: A Vivado HLS bus interface protocol.

Lab 3: Implementing Arrays as RTL Interfaces

This lab reviews how array ports are implemented and can be partitioned.

  1. Creating a New Project (This design has an input array and an output array.)

  2. Synthesize Array Function Arguments to RAM Ports

  3. Using Dual-Port RAM and FIFO Interfaces

  4. Partitioned RAM and FIFO Array interfaces

  5. Fully Partitioned Array Interfaces

Important points to be noted

  • Array arguments in the C source are by default synthesized into RTL RAM ports.

  • High-Level Synthesis allows you to specify a RAM interface as a single-port or dual-port. If not specified, Vivado HLS automatically analyzes the design and selects the number of ports to maximize the data rate.

  • An array argument is implemented using multiple RTL ports, only when the loops are unrolled or when the design is pipelined.

  • By using a dual-port RAM interface, input data can be accepted at twice the rate as compared to a single-port RAM interface.

Lab 4: Implementing AXI4 Interfaces

Create an optimized implementation of the design and add AXI4 interfaces.

  1. Creating a New Project (This design has an input array and an output array.)

  2. Create an Optimized Design with AXI4-Stream Interfaces: Varying the partitioning and loop unrolling allowed the optimal balance of area and performance.

  3. Implementing an AXI4-Lite Interfaces

When AXI4-Lite interface is added, the IP packaging process creates software driver files to enable an external block, typically a CPU, to control this block (start, stop , set port values, review the interrupt status).

Arbitrary Precision Types

C/C++ provided data types are fixed to 8-bit boundaries:

  • char (8-bit)

  • short (16-bit)

  • int (32-bit)

  • long long (64-bit)

  • float (32-bit)

  • double (64-bit)

  • Exact width integer types such as int16_t (16-bit) and int32_t (32-bit)

Using standard C data types for hardware design results in unnecessary hardware costs. Operations can use more LUTs and registers than needed for the required accuracy, and delays might even exceed the clock cycle, requiring more cycles to compute the result.

Vivado High-Level Synthesis (HLS) provides a number of bit accurate or arbitrary precision data-types, allowing you to model variables using any (arbitrary) width.

Lab 1: Arbitrary Precision

This lab synthesizes the same function used in Lab 1 using arbitrary precision fixed-types highlighting how the same design can be converted to the Vivado HLS ap_fixed types, retaining the required accuracy but creating a more optimal hardware implementation.

  1. Creating a New Project

  2. Review Test Bench and Run C Simulation

  3. Synthesize the Design and Review Results

High-Level Synthesis can synthesize floating-point types directly into hardware, provided the operations are standard arithmetic operations (+, -, *, %).

Lab 2: Arbitrary Precision

This lab synthesizes a design using standard C++ floating-point types and reviews the results.

  1. Creating a New Project

  2. Review Test Bench and Run C Simulation: The test bench checks the accuracy of the results by comparing standard C++ floating-point types with HLS Arbitrary Precision types. The results are within a specified range of accuracy.

  3. Synthesize the Design and Review Results: Through use of arbitrary precision types, both the latency and the area have reduced (by 50% and 80% respectively), and the operations in the RTL hardware are no larger than necessary.

Changing data types from standard C types to arbitrary precision types, make sure to reduce the size of the data types. This results in smaller operators, reduced area, and fewer clock cycles to complete.

Design Analysis

The general design methodology for creating an RTL implementation from C, C++, or SystemC includes the following tasks:

  • Synthesizing the design.

  • Reviewing the results of the initial implementation.

  • Applying optimization directives to improve performance.

These steps can be repeated until the required performance is achieved. Subsequently,the design can be revisited to improve area.

Vivado High-Level Synthesis (HLS) provides a number of bit accurate or arbitrary precision data-types, allowing you to model variables using any (arbitrary) width.

Lab 1: Design Optimization

This lab uses the insights from a DCT design analysis to applies optimizations and judges the effectiveness of the optimization.

  1. Creating a New Project

  2. Review the Source Code and Create the Initial Design

  3. Review the Performance Using the Synthesis Report

  4. Review the Performance Using the Analysis Perspective

  5. Apply Loop Pipelining and Review for Loop Optimization

  6. Apply Loop Optimization and Review for Bottlenecks

  7. Partition Block RAMs and Analyze Concurrency

  8. Partition Block RAMs and Apply Dataflow optimization

  9. Optimize the Hierarchy for Dataflow

Important points to be noted

  • High-level synthesis might automatically inline small functions to improve the quality of results (QoR). This can be prevented by adding the Inline directive with the -off option to any function being automatically inlined.

  • The Analysis perspective consists of five panes:

    • Module Hierarchy Pane: Navigate through the hierarchy.

    • Performance Profile Pane: the performance details for a particular level of hierarchy. Also shows how the operations in a particular block are scheduled into clock cycles.

    • Resource Profile Pane: the resource details for a particular level of hierarchy.

    • Console pane: Comprises of Properties, Outline, Warning, and C Source tab etc.

    • Informtion pane: Shows the contents of any files opened along with report files.

  • To improve the initiation interval further from this state, 2 methods can be utilized:

    • Pipeline the loops

    • Pipeline the entire function

  • Pipelining the function unrolls all the loops within it, and thus greatly increases the area. If the objective is to get the highest possible performance with no regard for area, this may be the best optimization to perform.

  • Pipelining loops transforms the latency from Latency = iteration latency * (tripcount * interval) to Latency = iteration latency + (tripcount * interval)

  • Bottlenecks are limitations in the flow of data that can prevent the logic blocks from working at their maximum data rate. Such limitations in the data flow can come from a number of sources.

  • Another source of bottlenecks is data dependencies in the original source code. In some cases, these data dependencies are inherent in how the algorithm operates, as when a calculation cannot be performed until an earlier calculation has completed. Sometimes, however, the use of an optimization directive or a minor change to the C code can remove them.

  • Approaches for identifying issues in the RTL design:

    • Start with the largest latency of interval in the Module Hierarchy report and navigate down the hierarchy to find the source of any large latency or interval.

    • Click the Resource Profile to examine I/O and memory usage.

    • With the use of graphical viewer, look for patterns in the Performance view which indicate a limitation in data flow.

  • The Resource view in analysis perspective shows how the resources in the design are used in different control states.

Design Optimization

A crucial part of creating high quality RTL designs using High-Level Synthesis is having the ability to apply optimizations to the C code. High-Level Synthesis always tries to minimize the latency of loops and functions.To achieve this, within the loops and functions, it tries to execute as many operations as possible in parallel. At the level of functions, High-Level Synthesis always tries to execute functions in parallel.

In addition to these automatic optimizations, directives are used to:

  • Execute multiple tasks in parallel, for example, multiple executions of the same function or multiple iterations of the same loop. This is pipelining.

  • Restructure the physical implementation of arrays (block RAMs), functions, loops and ports to improve the availability of data and help data flow through the design faster.

  • Provide information on data dependencies, or lack of them, allowing more optimizations to be performed.

The final optimization technique is to modify the C source code to remove unintended dependencies in the code that may limit the performance of the hardware.

Lab 1: Optimizing a Matrix Multiplier

Aim

To showcase contrast between the uses of loop and function pipelining to create a design that can process one sample per clock. To analyze loop dependencies and data flow limitations or bottlenecks.

A matrix multiplier is used to design to show how to fully optimize a design heavily based on loops. The design goal is to read one sample per clock cycle using a FIFO interface, while minimizing the area.

  1. Create a project

  2. Synthesize and Analyze the Design: By default, the code is implemented without pipelining to set the benchmark.

  3. Pipeline the Inner Loop: It fails to achieve the goal due to Carried dependency (mentioned below.)

  4. Pipeline the Outer Loop: It fails to achieve the goal as 2-cycle BRAM read operations overlap.

  5. Reshape the Arrays: Arrays are reshaped using Array Reshape Directive to achieve 1 clock cycle II.

  6. Apply FIFO Interfaces: It is impossible to use a FIFO interface for data access with the code as written. To use a FIFO interface, the optimization directives available in Vivado HLS are inadequate as the code currently enforces a certain order of reads and writes.

  7. Pipeline the Function: The design latency decreases further however, the area and resources have increased substantially as all the loops are unrolled.

Important points to be noted:

  • To improve the initiation interval further from this state, 2 methods can be utilized:

    • Pipeline the loops

    • Pipeline the entire function

  • When pipelining nested loops, pipelining the inner-most loop, might affect the highest as it is ran the most number of times.

  • Carried dependency: This is a dependency between an operation in one iteration of a loop and an operation in a different iteration of the same loop.

  • High-Level Synthesis automatically applies loop flattening, collapsing the nested loops, removing the loop transitions (essentially creating a single loop with more iterations but overall fewer clock cycles).

  • Arrays are implemented as block RAMs and arrays which are arguments to the function are implemented as block RAM ports. As a block RAM can only have a maximum of two ports (for dual-port block RAM), reading more than 2 values in one clock cycle is not possible.

  • High-Level Synthesis allows arrays to be partitioned, mapped together and re-shaped.

  • High-Level Synthesis can only report one schedule error or warning at a time.

  • The default behavior of High-Level Synthesis is to produce a design with the highest performance.

  • Pipelining loops allows the loops to remain rolled, thus providing a good means of controlling the area.

  • The pipelined function results in the best performance.

  • There is a trade off between Performance and Area.

Lab 2: C Code Optimized for I/O Accesses

This lab shows how modifications to the code can help overcome some performance limitations inherent, but unintended, in the code.

In lab 1, the nature of the C code, which specified multiple accesses to the same addresses, prevented streaming interfaces being applied.

  • In a streaming interface, the values must be accessed in sequential order.

  • HLS cannot decide to change the specification of the algorithm.

  1. Create a project

  2. Review the code:\

    • The directives are specified in the code as pragmas.

    • For-loops have been added to cache the row and column reads.

    • A temporary variable is used for the accumulation and result is written only when the final result is computed for each value.

    • Cache for-loops are automatically unrolled.

  3. Synthesize the design and verify the RTL using co-simulation: Successful synthesis with reading one sample every clock cycle using streaming FIFO interfaces.

RTL Verification

The High Level Synthesis tool automates the process of RTL verification and allows you to use RTL verification to generate trace files that show the activity of the waveforms in the RTL design.

RTL verification is often called cosimulation or C/RTL cosimulation; as both C and RTL are used in the verification.

Lab 1: RTL Verification and the C Test Bench

This lab performs RTL verification steps and understands the importance of the C test bench in verifying the RTL.

  1. Create a project

  2. Perform RTL Verification:\

  3. Modify the C test bench

Lab 2: Viewing Trace Files in Vivado

This lab creates RTL trace files and analyzes them using the Vivado Design Suite. The steps involved in the lab are:

  1. Create an RTL Trace File using Vivado Simulator.

  2. Perform RTL Verification

  3. Modify the C test bench

RTL simulation comprises of three phases.

  1. WrapC Simulation: The C test bench is executed to generate input stimuli for the RTL design.

  2. RTL Simulation: An RTL test bench with newly generated input stimuli is created and the RTL simulation is then performed.

  3. Post-Checking Simulation: Finally, the output from the RTL simulation is re-applied to the C test bench to check the results.

RTL verification issues message SIM-1000 if the RTL verification passed.

Lab 3: Viewing Trace Files in ModelSim

This lab creates RTL trace files and analyzes them using a third-party RTL simulator.

Using HLS IP in IP Integrator

RTL from High-Level Synthesis can be packaged and use it inside IP Integrator.

Lab 1: Integrate HLS IP with a Xilinx IP Block

This lab completes the steps to generate two HLS blocks for the IP catalog and use them in a design with Xilinx IP, an FFT. The objective is to validate and verify the final design using an RTL test bench.

  1. Create Vivado HLS IP Blocks

  2. Create a Vivado Design Suite Project

  3. Add HLS IP to an IP Repository

  4. Create a Block Design for RealFFT

  5. Verify the Design

Using HLS IP in a Zynq AP SoC Design

A common use of High-Level Synthesis design is to create an accelerator for a CPU -- to move code that executes on the CPU into the FPGA programmable logic to improve performance.

Lab 1: Implement Vivado HLS IP on a Zynq Device

This lab integrates both the High-Level Synthesis IP and the software drivers created by HLS to control the IP in a design implemented on a Zynq device.

  1. Create a Vivado HLS IP Block

  2. Create a Vivado Zynq Project

  3. Add HLS IP to the IP Catalog

  4. Creating an IP Integrator Block Design of the System

  5. Implementing the System

  6. Developing Software and Running it on the Zynq System

  7. Modify software to communicate with HLS block

Lab 2: Streaming Data Between the Zynq CPU and HLS Accelerator Blocks

This lab illustrates a common high performance connection scheme for connecting hardware accelerator blocks that consume data originating in the CPU memory and/or producing data destined for it in a streaming manner.

Last updated