Header Ads

Exploiting Hardware-Level Parallelism in the Manticore Hardware-Accelerated RTL Simulator

Before a chip design is turned from a hardware design language (HDL) like VHDL or Verilog into physical hardware, testing and validating the design is an essential step. Yet simulating a HDL design is rather slow due to the simulator using either only a single CPU thread, or limited multi-threading due to the requirements of fine-grained concurrency. This is due to the strict timing requirements of simulating hardware and the various clock domains that ultimately determine whether a design passes or fails. In a recent attempt to speed up RTL (transistor) level simulations like these, Mahyar Emami and colleagues propose a custom processor architecture – called Manticore – that can be used to run a HDL design after nothing more than compiling the HDL source and some processing.

In the preprint paper they detail their implementation, covering the static bulk-synchronous parallel (BSP) execution model that underlies the architecture and associated tooling. Rather than having the simulator (hardware or software) determine the synchronization and communication needs of different elements of the design-under-test, the compiler instead seeks to determine these moments ahead of time. This simplifies the requirements of the Manticore execution units, which are optimized to execute just this simulation task.

Although an ASIC version of Manticore would obviously be significantly faster than the FPGA version the researchers used in this implementation, the 475 MHz, 225-core implementation on a Xilinx UltraScale+ FPGA (Alveo U200 card) compared favorably against the Verilator simulator which was run on three x86 systems ranging from an Intel Core i7-9700K to an AMD EPYC 7V73X. Best of all was the highly impressive scaling the Manticore FPGA implementation demonstrated.

At this point Manticore is primarily a proof-of-concept, which like every PoC comes with a number of trade-offs. The primary limitation being that only a single clock domain is supported, HDL support for SystemVerilog is limited, the Scala-based tooling is very unoptimized, and waveform debugging is a TODO item. What it does demonstrate, however, is that RTL-level simulators can be made to be significantly faster, assuming BSP lives up to its purported benefits when faced with more complicated designs.


No comments