Implementation Detail: The end-to-end equivalent checking framework of HLS DEEQ is implemented in Python and is tested on a set of HLS benchmarks. The Vivado HLS tool [10] is used to generate Verilog RTL for the benchmarks written in C. The benchmarks are taken from [107]. We have then used the pyVerilog [9] parser to extract the abstract syntax tree (AST) from the Verilog and then implemented the rewriting method to obtain the RTL-C. Specifically, we have adapted FastSim [15] to generate the RTL-C from Verilog.

The RTL-C generation time is less than 5ms for all cases. All traces of both the behaviours ( C and RTL-C) are obtained by running Klee [7]. Klee also gives a test case corresponding to each trace. We forced Klee to get expressions for the output variables in smtlib format.

The output equivalence of two traces is formally verified using the SMT solver Z3 [54]. We have also used Z3 to identify compatible traces for merging. The experiments have been performed on a machine with a CPU: Intel Core i7, 2.5GHz, and 8GB RAM.

Experiments: The experiment results of our benchmarks are shown in Table 4.1. The
2^{nd} (#in) and 3^{rd} (#out) show the number of inputs and outputs for each benchmark,
respectively. We have recorded the number of code lines (#line) and variables (#var) of
the input C, RTL, and RTL-C in the next six columns. The number of lines in the RTL-C

DEEQ: Data-driven End-to-End EQuivalence Checking of High-level Synthesis

Table 4.1: Experimental results for different high-level synthesis benchmarks

Bench marks

#in #out C code RTL code RTL-C Traces Equivalent Not Equivalent

#line #var #line #regs #line #var #C #

RTLC#merged time(s) result time(s) result (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) Waka 20 3 33 21 270 12 474 126 3 4 (3, 3) 1.709 Eq 0.669s NEq Arf 11 4 53 43 351 19 607 158 4 4 (3, 3) 1.890 Eq 0.949 NEq Parker 6 1 51 14 188 2 330 100 12 23 (2, 2) 1.614 Eq 0.976 NEq

Find

Min8 8 1 40 15 175 11 780 243 128 128 (8, 8) 22.246 Eq 17.141 NEq Matrix

Add 2 1 48 7 734 44 2595 241 1 1 (1, 1) 1.684 Eq 0.749 NEq Sum

Array 1 1 19 4 263 15 541 100 1 1 (1, 1) 0.754 Eq 0.706 NEq Motion 10 3 52 43 415 29 780 235 1 1 (1, 1) 0.681 Eq 0.663 NEq Dfadd 2 1 554 70 1975 113 2132 650 67 68 (21,42) 1016.05 Eq 960.23 NEq

code is high as compared to the RTL since each register is copied in a temporary variable
at the start of each state to maintain the concurrent execution of operations of hardware
in C [15]. The 10^{th}, 11^{th} and 12^{th} columns are the number of traces of the input C (#C),
RTL-C code (#RTLC) and merged traces (#merged), respectively. Equivalence checking
results in terms of the total time spent are given in the 13^{th}. As shown, our tool successfully
established the equivalence (Eq) in all cases.

In our further experiment, we have created some non-equivalent scenarios from the RTL-C by swapping if-else conditions, changing operation type, or adding/deleting some operations. For example, one of the operation tmp6 reg 362 = (in7 + add5 reg 305 temp) taken from motion is intentionally replaced by tmp6 reg 362 = (in7 * add5 reg 305 temp);

clearly these two operations are not equivalent. In this scenario, our equivalence checking
method reported non-equivalent. In all cases, these errors are correctly identified by our
proposed method as shown in the 15^{th} (time) and 16^{th} (result) columns. Our tool reports
non-equivalence (NEq) in all cases since Z3 returns SAT with a witness of violation. Our
framework can prove the equivalence of all eight benchmarks successfully within two sec-
onds except for findMin8 and Dfadd. For findMin8, merging of compatible traces is taking
almost 15 seconds. For Dfadd, the number of traces after merging is not equal. As a result,

Experiment Results and Analysis

equivalence checking takes more time.

Result Analysis: There are several interesting scenarios that arise during our experiments:

(i) For FindMin8, there are many compatible traces found since one of the 8 inputs will be the minimum. As a result, 128 traces are merged into only 8 traces. Therefore, merging compatible traces reduce the complexity in equivalence checking. (ii) For Parker and Waka examples, the number of traces becomes the same after compatible trace merging resulting in improving verification complexity. None of the existing techniques [88] can handle such scenarios of equilavence. (iii) In Dfadd, the number of traces is not equal even after merging.

However, equivalence is proved by our method. (iv) Except dfadd, equivalence is established efficiently, i.e., in the while loop of Algorithm 1 due to merging compatible traces.

Comparisons: Although a few end-to-end HLS verification methods like VTV [88] and V2C [100] are available, the C to RTL verification for HLS is still an open problem. We found that V2C generates incorrect C code from the Verilog generated by the Vivado HLS tool. So, it can’t be used for our purpose. The VTV is the closest to our work which is also applied to verify the results of Vivado HLS. However, VTV fails to show the equivalence when the number of traces is not the same in C and corresponding RTL since it does not support optimizations that alter the control structure. Therefore, VTV will fail in three (i.e., waka, Parker, and dfadd) of our test cases. This shows the novelty of our proposed method over the state-of-the-art techniques.

Usefulness: The benchmarks consist of complex if-else (findMin8), loops and arrays (ma- trixAdd, sumArray), complex non-linear arithmetic operations (arf and motion), and func- tion calls (Dfadd). Also, our method works for non-equivalent cases. Therefore, DEEQ is useful in the verification of HLS results for a commercial HLS tool. The scalability of the method is demonstrated with a relatively larger benchmark having around 3K lines of RTL-C code (Dfadd). Since our method merges many compatible traces before check- ing equivalence and uses a data-driven approach to find the correspondence of traces, the method is expected to scale well for larger benchmarks as well. We have also tested on a recent bug reported in the Vivado HLS tool in [72] that produces the wrong result dur- ing RTL simulation in Xilinx Vivado HLS v2017.2. In this test case, a large integer value is shifted repeatedly by the values stored in an array. For this example, our equivalence checker reported the non-equivalence. We have tested with FastSim as well and it reports the non-equivalence with output for source code as 73741823 and for the RTL-C as 6632959.

Therefore, our tool can detect the reported bug in Vivado HLS. Thus, experimental results

DEEQ: Data-driven End-to-End EQuivalence Checking of High-level Synthesis

show the applicability and scalability of our method for a wide variety of applications.