Processor Design Project

This is the homepage for the Processor Design Project. You can find most course information here.

This course is part of the CESE masters programme. This course is a continuation of the curriculum of Computer Arithmetic.

Notice of Changes

This course will be SIGNIFICANTLY changed this year, and the information on this page is no longer correct. It is currently being updated.

Project Overview

  • You do this project in groups of 3
  • You can choose your own groups
  • Once you have chosen a group (you can choose yourself),
    • Join a brightspace group
    • send an email to j.b.doenszelmann@tudelft.nl with three names and the email addresses linked to your dropbox account. This will give you access to a git repository and the dropbox folder in which you can upload your benchmarks

In this course, you will be tasked to improve the performance of a MIPS-based cpu in groups of 3. To evaluate your new design, you are given a set of benchmarks from the GMPbench and MiBench suites, and remote access to Zynq-7000 FPGA boards. To do so, you will receive:

  1. A git repository which includes the MIPS processor VHDL code and the benchmarks, the necessary files required for simulation and FPGA implementation, as well as the MIPS cross-compiler
  2. This website, detailing the project assignment and the simulation general flow and toolset.

Schedule

DateActivity
April 22nd in EWI/EEMCS Lecture Hall D@taKickoff Meeting
Every Thursday (except public holidays)Q&A Lab in EWI Hall M from 13:45 to 17.45
May 8th and 9thMidterm Milestone Meeting
June 23rdReport Submission
TBD.Project Presentation

Getting Started

  • Find a Group and let us know about it!
  • Carefully read Project Workflow. It contains how information on how to do the project. It has chapters for:
  • If you have questions on how to set up Vivado, or need other support, come to the weekly labs!
  • In Platform description describes the MIPS processor you will be modifying in this project.
  • In Frequently Asked Questions we are collecting some frequent questions. Please contribute in the labs!

Here You can find the baseline scores for all benchmarks.

Grading Procedure

The projects functionality will be verified and checked for (including between groups) plagiarism.

If the project is not functional you DO NOT pass the course. Plagiarism can also make you fail.

The final score for the project is determined based on the following criteria:

  • Design Performance (DP) - The benchmark score is quite important at this point, the higher the better, but this is not the only relevant aspect. We also take into consideration the other metrics with emphasis on the compound ones which give inside on the effectiveness of your proposal, e.g., area overhead vs achieved improvement.
  • Technical Merit (TM) - Aspects as innovation level and implementation quality are considered.
  • Report (R) - Report organization, content, and language are important aspects at this point.
  • Presentation (P) – Presentation organization, slides, and the live performance are the main metrics,

The CESE4040 final grade is computed as:

Staff

The course will be led by Sorin Cotofana, with some assistance of Jonathan Dönszelmann.

Additionally, some TAs will help out: *

Contact

Software Setup

See Vivado Installation

Milestone Meeting

The milestone meeting is in person and per group, and it is meant to be a midterm status check. It is NOT optional!

For the milestone meeting you must:

  1. Get familiarized with the project set-up and the FPGA server.
  2. Carry on the performance evaluation of the base-line processor design on the FPGA board with the provided benchmark suite.
  3. Make use of Vivado synthesis and implementation options to optimize the base-line performance without operating any changes on the SoC design (except increasing the clock frequency if the slack allows for it) and redo the FPGA evaluation.
  4. Perform a Cache Design Space Exploration (DSE). The baseline design includes a 2KB direct mapped cache, which only maps the first 2MB of the main memory. However, the cache design is parametric thus during the DSE process you may vary the cache size, the amount of main memory mapped in the cache, the cache line width, and cache associativity. Moreover, you can also operate other changes on the cache, e.g., change the replacement policy in case of associative caches. During the DSE you need to evaluate the clock frequency, execution cycles, area, and energy consumption of all the considered cache configurations and determine the most effective one, according with your own design goal.
  5. Identify at least two more possible improvement avenues.

For the milestone meeting you are required to present a short progress report that:

  1. Provides an overview of the Vivado optimization settings and comments on their impact on platform performance.
  2. Describes and comments on the impact of memory hierarchy changes on the platform performance.
  3. Presents the improvement approach that you decided to follow (justify your choices), the expected performance improvement, and the work status (up to date achievements)

Project Submission

The project submission must be done via e-mail.

In your final submission, you should include:

  • The report – preferably a pdf - (naming convention: CESE4040_2023_report_g#, with # being the group number) in the root of your git repository. See Reporting
  • Your git repository (ONLY the main branch counts)

The report should include the following:

  1. The general idea behind you overall optimization proposal. Related to the parts that you decided to change or to append you must answer (at least) these questions: What, how, and why?
  2. The design of the improved and/or appended part(s). There is no need to go at the gate level! The basic algorithm you decided to implement, and the RTL designs are in general enough. In case some parts are relevant you may go to full adder/gate level if this is essential for your proposal.
  3. Performance results for the baseline design (the original MIPS system) and your improved core. Those should include the following: Detailed implementation results including timing information, critical path information, resource utilization information from Vivado report, and power consumption figures.
  4. A comparison between your design and the baseline design in terms of basic metrics, i.e., Area (A), critical path Delay (deduced from clock frequency) (D), Benchmarks Score in terms of execution clock cycles (BS), maximum Power (P), total Energy consumption (E), as well as compound metrics, e.g., , , and products.
  5. Comments on the obtained results, e.g., answer to “Which improvement is mostly contributing to each figure of merit?”.
  6. Conclusions.

The report may also include, but it is not mandatory, suggestions for this project! Your feedback is very much appreciated. If you want to do so anonymously, you're also invited to fill in the Evasys questionaire after the course is over.

Methodology

The first question we need to address is “What do we need in order to improve the system?” Starting from one of your targets, e.g., to improve the computational performance of an arithmetic core, you need at least an Electronic Design Automation (EDA) tool to describe your design by using VHDL. An EDA tool is not just a text editor for VHDL, even though most of the time you’ll be struggling in VHDL, it creates the premises to translate your VHDL description into "real" FPGA hardware.

Then you need to evaluate your design, to check its correctness and efficiency, so you need a way to simulate your design and to map it into real hardware. Moreover you need benchmarks to evaluate its correctness and performance.

Since an arithmetic core cannot run on its own, you need a supporting platform to embed it in, and on this platform you can execute the benchmark programs. This platform should be easily implementable in the same way in which your designs are going to be implemented. Concerning the benchmarks, you can develop them by yourself to verify your design, or you can use standard generally acceptable sets to evaluate your design and then to compare it with equivalent counterparts. The benchmark programs are usually written in a programming language with little runtime overhead, like C or Rust for example.

Generally speaking, the supporting system on which the implementation tools and the benchmark compilation are run on, has another instruction set than the general-purpose processor for which you develop your project. Thus you need a dedicated compiler (a "cross compiler" as the target platform is different than the host platform) to compile the benchmark programs for the supporting platform. Once you get the EDA tools, the supporting processor, implement the required hardware, and generate the benchmarks binary via compilation, you can start evaluating your design.

In the best scenario you can simulate your design and synthesize it without any errors or bugs. In that case you can carry on the evaluation and if the obtained performance is satisfactory you can start writing your report. However this is quite unlikely to happen after the first attempt. Most of the time, you will get fatal errors at compilation time, soft errors at execution time, or some strange errors you don’t have a clue where they are coming from. In any of these cases, you need some debugging tools, for hardware or software, and/or for both of them.

FPGA work

Work with an actual FPGA is done remotely. To do so, you should make sure we have you email address that is linked to a dropbox account. More information can be found here.

Project Workflow

To design, debug, and evaluate your improved processor you have to synthesize it on an FPGA board. For that you should follow the following workflow:

project workflow

In the project you will follow these steps:

  1. The baseline design (the reference design that you will receive) should be analyzed and its weak point(s) determined by means of, e.g., profiling.
  2. Avenues for improvement should be identified.
  3. For each processor component you want to add or change, you should make a pen-and-paper plan capturing its input and output ports and RTL design.
  4. Write the corresponding HDL code for that component and test whether it performs its function as intended or not (behavioral RTL simulation).
  5. Only then, you can test the design on the FPG. For that the following steps need to be taken
    1. Logic synthesis, converting the RTL constructs from the HDL code into a design implementation in terms of generic resources (LUTs, MUXes, memory, flip-flops, adders, etc.).
    2. Map the design generic resources into the specific resources available on the Zynq FPGA (e.g., slices, IOBs)
    3. "Place and Route", wiring components together, such that the design criteria imposed (such as pin constraints for components ports and timing constraaints) are being met.
  6. Load the resulting "bit file" onto the FPGA, to test.

Behavioral testing

Generally speaking, it is advisable to write a HDL testbench and test each component individually, before integrating it inside the system. A bit like unit tests in software.

The alternative would be to opt for testing the entire system, maybe through the c programs we provide. Though possible, it will be harder to debug your design that way.

To evaluate your processor, a set of benchmarks will be executed. A software application that will initialize the board, send the benchmark to the DDR memory, and finally collect the benchmark results from the board and further display them is also required.

This application can be executed on your computer, to test in the provided emulator. However, once you uploaded the bitstream to the FPGA board, the same code can be executed there, since the board also incorporates an ARM processor. The system timing performance, area, and energy consumption can now be evaluated and appropriate further actions undertaken.

While doing this project, we strongly recommend that you do the following:

  • Make a clear plan for your design before you start writing VHDL code.
  • Run a module level simulation every time you modify something in your VHDL code, before you start running a system level simulation.
  • Run a system simulation every time you change your design before you synthesize it!
  • Put your source code in git, so that you can restore old versions.

To evaluate your enhanced MIPS processor, a Pynq-Z1 board (reference manual), will be used.

The Pynq board

On this board, there's a Zynq-7000 SoC chip (reference manual), which is divided into two distinct subsystems:

The Pynq board overview

  • The Programmable Logic (PL), which includes an array of programmable logic blocks and a set of dedicated resources (e.g., block RAM memories for dense memory requirements, DSPs for high speed arithmetic, analog-to-digital converters)
  • The Processing System (PS), whose key components include:
    • hard silicon ARM Cortex-A9 processors
    • clock generators
    • DDR and memory controllers
    • I/O peripherals (e.g., GPIO, UART)
    • PS-PL connectivity interfaces.

Vivado installation

The software we use for simulation and to synthesize a design for the FPGA is Xilinx Vivado. Installers are available for both Linux Mac and Windows. On linux, we recommend against trying to install vivado through your package manager, since we require a specific version of the software. You can download the installer at the following address https://www.xilinx.com/support/download. Make sure you get version 2023.2 ("Vivado Design Suite - HLx 2023.2") (though likely 2022.2 also works). Before downloading, you may need to create an account. During installation, select Vitis as the edition to be installed, with the following components:

Installation Settings

The installation requires between 50GB and 100GB, and doesn’t need a license after installation to be used.

Cable drivers

Vivado might complain about missing cable drivers on linux. To install:

cd <vivado_install_dir>/data/xicom/cable_drivers/lin64/install_script/install_drivers/
sudo ./install_drivers

Read more at https://docs.amd.com/r/en-US/ug973-vivado-release-notes-install-license/Installing-Cable-Drivers

Functional Verification

All the vhdl code (except the benchmarks files) is located in the rtl folder. The sim folder contains all files required for simulation, and the fpga folder contains all files needed for FPGA implementation.

To verify the correctness of your HDL dsign, you should run both module level and system level simulation. To perform module level verification, you have to write a testbench yourself. That test bench can then test a single module of your design.

System level verification

For system level verification, we have already provided you with a top-level testbench. You can find it in sim/testbench_vivado_1.vhd. To run it, follow these steps:

  1. Open the vivado project, in vivado, not vitis. You can find it in sim/_zynq_sim/synq_sim.xpr.
  2. Find the Flow Navigator pane, it's left of the main window.
  3. Select IP INTEGRATOR → Open Block Design. This will open the block diagram of your HDL design, showing you the address map of its components.
    • If you double click on a component here, you can change its parameters. However, be careful if you do that, because if you change a parameter that corresponds to a value in VHDL, you must also change that value in VHDL, otherwise it will not actually be updated.
  4. Select a program to test with. By default this is benchmarks/opcodes/opcodes.asm. However, you can change which one it uses in rtl/platform/bram.vhd. For example, you can set it to run pi, which is the name of another benchmark.
    • The benchmarks are programs written in c, which can be compiled for our simulated MIPS processor, generating a vhd file with the mips instructions in RAM. Check benchmarks/README.md for further instructions and sim/main_pack_opcodes.vhd for an example vhd.
    • The simulator outputs over UART, which by default is saved in sim/zynq_sim/zynq_sim.sim/sim_1/behav/xsim/uart_output.txt. However, if you'd like to change this you can do so by changing the Log File parameter of the uart component.
  5. Start the simulation
    • Go back to the Flow Navigator pane
    • Select SIMULATION → Run Simulation → Run Behavioural Simulation.

At this point, you can select which signals you'd like to inspect in the waveform output of your design. In the image below, you can see how to do this.

Instructions on how to test your program

Now, you can either run the simulation until manually interrupted (the Run All button) or by setting the simulation time in the textbox and pressing the smaller arrow besides it. To get an idea of the timescales we're working with, the main_pack_opcodes benchmark takes around 35ms.

Testing Toolbar

NOTE: Every time you make a change in the RTL code, the simulation should be relaunched.

For more information, this manual, and this tutorial might help you.

Once your design passes system level verification, you can continue to the next step: Synthesizing your design. However, if the tests don't pass, you will need to revisit your HDL code and redo the steps in this chapter.

Custom Tests

The assembly code corresponding to a test of all possible ISA opcodes is provided in benchmarks/opcodes/opcodes.asm. Although it can help you debug your design, you might want to make some finer grained tests than that. For that reason, we also include a cross-compiler and assembler for the MIPS architecture you're working on.

To make your own benchmark written in C, you can change the c code in benchmarks/custom. To run these, execute

cd benchmarks/custom
make clean image

A corresponding .vhd file will be generated in the same folder.

Modifying the ISA

This section is incomplete. Older manuals did have it, but the information got outdated. If you want to do this, send an email to S.D.Cotofana@tudelft.nl, also to get the compiler source code files.

If you choose to modify the instruction set of the processor you're building, you will also need to change the compiler for it. Otherwise, the new instructions you add will not be used by the compiler, or if you remove instructions, the compiler will still happily generate them.

To do this, we provide the source of the MIPS cross compiler: mips-pdp-elf-...., as well as tools to build this cross compiler itself (crosstool-NG).

The required compiler modifications should be made in the form of patches which should be placed in the folder toolchain-ctng/ctng-bin/lib/crosstool-ng-1.22.0/patches/my_patches.

The source packages used to build the cross-compiler are available in the folder toolchain-ctng/ctng_toolchain_src. It is advisable not to overwrite the original compiler, but to run

ct−ng menuconfig

Synthesizing your Design

To synthesize and implement your design, open the Vivado project file (zynq_fpga.xpr) in the folder fpga/zynq_fpga.

The Flow Navigator pane (left of the window) contains all the available FPGA implementation flow steps.

Using this pane works in a number of steps:

  • functional verification
  • updating the block diagram to reflect the parameters settings from the simulation block diagram and validating this
  • generating the bitstream

This is accomplished from the Flow Navigator pane, either step by step allowing to analyze the results of each step:

  • SYNTHESIS → Run Synthesis;
  • IMPLEMENTATION → Run Implementation;
  • PROGRAM AND DEBUG → Generate Bitstream

Or directly by clicking PROGRAM AND DEBUG → Generate Bitstream, which will automatically execute all the previously mentioned steps.

Upon a successful completion the bistream file (.bit) will be created.

Normally, block diagram output products necessary for synthesis are automatically re-generated every time. However, if for some reason the older files are used, the output products can be forcefully re-generated prior to running synthesis by selecting in the Flow Navigator pane IP INTEGRATOR → Generate Block Design and then selecting Global as the Synthesis Options.

Vivado FPGA Implementation Strategies

Depending on the algorithms and overall strategies selected within the Vivado tool synthesis and implementation* menu, slightly different FPGA implementations (with different area utilization and timing performance) can be obtained.

A strategy is a set of Vivado tool settings, which specify the design flows and the optimization levels. Vivado provides a set of predefined strategies for synthesis and implementation. Alternatively, you can also create your own strategies Vivado's Synthesis & Implementation settings, displayed below, can be accessed from the Flow Navigator pane via Settings.

Vivado Synthesis Settings Vivado Implementation Settings

While synthesis consists of just a single subprocess, the implementation step has multiple such subprocesses, each consisting of a series of steps which can be optimized with specified effort level (-directive setting). Identifying effective strategies can lead to best results based on specified design goals. For example, you can use this when you have a performance oriented goal. By changing the Synthesis and Implementation strategy, minor timing closure violations (e.g., small negative slack) can be resolved. This can even lead to a higher maximum clock frequency for the design. For further details about the Vivado tool Synthesis and Implementation strategies, please refer to the Vivado Design Suite User Guide - Synthesis and Vivado Design Suite User Guide - Implementation.

Implementation subprocesses

Processor Evaluation

Once you obtain a bug free design you may proceed to its evaluation. Some information can be found in the reports of vivado, such as the maximum operating frequency and the area of your design. Other results can only be found by executing benchmarks on the actual FPGA.

Area Evaluation

Below, an example of a design floorplan is given. This is the design floorplan of the baseline CPU we gave you, with the Programmable Logic (PL) and the Processing System (PS) sections outlined.

Baseline Design Floorplan

The main resource for combinatorial and sequential circuits are Configurable Logic Blocks (CLBs). Each CLB consists of 2 interconnected slices, as shown in the example above, which contain the following resources:

  • 8 look-up tables for random logic implementation or distributed memory,
  • 6 multiplexers,
  • 2 fast carry chains,
  • 16 Flip-Flops out of which 8 can be configured as latches.

For memory implementation, besides the distributed RAM (from CLBs), a number of 36Kb RAM Blocks (RAMBs) are available (see Vivado Design Suite 7 Series FPGA and Zynq-7000 SoC Libraries Guide for BRAMs instantiation primitives). To estimate the area of your design, a compound value should be derived from the implemented design's Utilization Report. You can generate this for the cpu_0 component through Vivado's IMPLEMENTATION → Open Implemented Design → Report Utilization. Below we've shown a sample report of the baseline cpu:

Utilization Report

You can then use this information to estimate the total area using the following relations:

  • The area of a slice is half the area of a CLB
  • The area of a a RAMB18 (18 kbit configurable ram block) is 1.2 times the size of a CLB
  • The area of a a RAMB36 (36 kbit configurable ram block) is 2.4 times the size of a CLB

In the example from above, cpu_0 doesn't use any block ram tiles, but 991 slices so the total area is

Timing Analysis

To check that the implemented design works at the requested frequency, a timing report should be generated in Vivado through IMPLEMENTATION → Open Implemented Design → Report Timing Summary. A timing tab will open up detailing all the timing constraints (usually one for each clock signal in the design) with the top slowest propagation paths and any potential timing violations.

Below, we've shown the timing summary section of the baseline cpu:

Timing of the Baseline CPU

This shows us whether all the timing constraints are met (slack values should be positive to meet timing) and thus that the design will be able to run in a reliable manner at the requested frequency. If the design does not meet the desired frequency, it is recommended to focus on the Worst Negative Slack (WNS) as the main way to improve the total negative slack.

If the slack is positive, the design is guaranteed to work in a reliably at the requested frequency. However, if the slack is negative the design might work within reasonable negative slack bounds, or it might not work properly at all. Based on the WNS value, we can also estimate the maximum attainable operation frequency as follows:

This FMAX estimation could be pessimistic, as Vivado does not try to further improve timing after the timing constraints are met.

Clicking on the value of WNS will display information for the top 10 paths with the worst delay in the design. Double clicking on the critical path (Path 1) will display detailed timing related information, among which the path start point and end point, the number of levels of logic, the logical resources included in the path, and the slack with respect to the requested frequency, as shown below.

Timing Summary Report: Intra-Clock Paths Details

Right-clicking on Path 1, and selecting Schematic will display the critical path schematic, as illustrated.

Critical Path Critical Path

Further information on how to read a timing report can be found in Vivado documentation

Executing Benchmarks

To evaluate your improvements of the cpu, you are given a set of benchmarks to be run on the FPGA board. After benchmark execution on the board the message "CORRECT!" is displayed if the benchmark results are as expected. Otherwise, an "ERROR" message is displayed.

In addition, to evaluate the performance of your solution, the number of CPU cycles (in million cycles) consumed by the benchmark execution, the peak power consumption (in mW), and the energy (in J) are measured and displayed.

The scores of the baseline cpu for these benchmarks are shown in the table below:

Baseline benchmark scores

NameDescriptioncycles (million)Time (s)energy (J)peak power (mW)
opcodesTests all MIPS I instructions (only for testing)----
cjpegJPEG compression100.0185901.69521.72201179.197090
divideLarge number (192Kb) integer division using GMP library707.85401211.99752.46711578.970367
multiplyLarge number (64Kb) integer multiplication using GMP library390.7397356.62272.18752881.413025
piComputes 1000 digits of PI using basic arithmetic operations884.50617814.99162.78494381.311775
firLength-63 bandpass FIR filter applied to 50000 input samples213.7407893.62271.69670872.882149
rsaRSA message signing using GMP library1420.50818224.07643.23066379.212349
ssdPattern matching using Sum-of-Squared-Differences2305.99225439.08464.39792082.596886
ssearchString search using look-up tables1419.87546224.06573.34313781.810158
susanGaussian image smoothing1601.72283827.14783.54081082.110268
benchallAll Benchmarks in one run8654.753342146.690711.77473081.538254

The C and assembly source code for each benchmark can be found in the benchmarks folder. Note that if your cpu performance improvement strategy relies (also) on MIPS ISA modification/augmentation, then you need to recompile the benchmarks before running them.

In order to execute the benchmarks on your modified cpu, 2 things are needed:

  1. the bitstream file (.bit) (generated in Vivado) which is to be uploaded to the board and used to configure the FPGA
  2. the benchmarks .bin file.

These two files are stored on the ZYNC's SD card. To manage the benchmarks execution, the ARM cpu on the Zynq board is used. Specifically, the ARM cpu tasks are: initialize the board interfaces, transfer the benchmark from SD card into the DDR memory, start the benchmark execution on the processor, receive the benchmark results from the processor, and redirect them in an external file. The ARM cpu tasks are performed according to a c program created in Vitis.

After the bistream generation has completed successfully in Vivado, you can use the workflow shown below to create a wrapper bitsteam called design_2_wrapper.bit.

The steps below line up with the ones highlighted in the picture below.

  1. Create a compound file design_2_wrapper.xsa, which should be in fpga/zynq_fpga. This file contains the bitstream and other files necessary for the initialization of comonents like clocks, DDR and GPIO.
  2. You can skip this step when you don't have physical access to the ZYNC like what is the case during this course.
  3. This one as well.
  4. Launch vitis and make it aware of the changed hardware specification design_2_wrapper.xsa.
  5. Next you should change a line in the code that will run on the ARM cpu next to the FPGA. (source code in appARMcpu_main.c). You should set of the name of the benchmark you want to run. This c program will allow:
    1. To read from the board SD card the target benchmark that will be evaluated on your processor and to transfer it to the PS DDR memory
    2. To initialize the PS components (clocks, DDR, GPIO), provide the PL clock and reset the PL logic
    3. to receive the benchmark results from the UART and redirect them to an external file.
  6. Compile this program through vitis to get an executable: appARMcpu.elf. Of course you can also skip step 6b since you don't have a board physically connected to your laptop.

You're now done with Vitis, and you can close it.

Vitis Workflow

Running Benchmarks Remotely

The communication of input/output files for remote FPGA access is performed via Dropbox shared folders.

Preliminary Dropbox folder setup:

For each group, a shared folder will be created on dropbox. In order to access the group folder, you are required to have Dropbox (www.dropbox.com) account(s) and provide us with the email addresses linked to your Dropbox accounts. Please send an email to j.b.doenszelmann@tudelft.nl with your group's names and email addresses which are linked to your dropbox accounts. A shared dropbox folder can then be accessed for upload/download either locally (if the Dropbox client is locally installed), or online via the Dropbox website.

The remote access flow for verifying the FPGA implemented design consists of the following steps:

  • Place the design_2_wrapper.xsa (located typically in the fpga/zynq_fpga folder) and appARMcpu.elf (located in the fpga/zynq_fpga/workspace/appARMcpu/Debug folder) files in the Dropbox folder. NOTE: Only files with these names are accepted.
  • The FPGA will be programmed automatically with the bistream design_2_wrapper.bit, and the ARM cpu prepared for executing the appARMcpu.elf application. Afterwards, these files will be automatically deleted from the Dropbox folder. NOTE: There are multiple FPGA boards. The scheduling time for programming each bitstream on an FPGA is compliant with a round robin scheduling policy, relative to the remaining Dropbox groups' folders and the individual bitstream files timestamp.
  • For each programmed bitstream, two output files will be generated in the Dropbox folder:
    • results.txt file with the benchmark execution related results (e.g., status of the benchmark results correctness, performance) – log.txt file consisting of ERRORS/INFO/WARNINGS concerning the status of the enti re simulation (e.g., files consistency, FPGA programming, UART receive/transmit).

In both .txt files, "DONE" marks a successful end of communication with the remote FPGA board.

Warning:

If you want to preserve the .txt files, you have to save them in a separate folder, as the next evaluation will overwrite them.

Custom Benchmarks

For testing a custom benchmark three files have to be placed in the Dropbox folder:

  • design_2_wrapper.xsa,
  • appARMcpu.elf (after being modified in Vitis to update the benchmark name to ’custom.bin’),
  • and a file called custom.bin (which is generated using the provided MIPS cross-compiler).

Energy Evaluation

The Vivado tool can provide a power consumption estimate of your design, obtained through IMPLEMENTATION → Open Implemented Design → Report Power. To derive the dynamic power consumption estimate, the Vivado tool, by default, does not require the user to specify any information related to the switching profile of the design nets (default switching rates are being assumed). The confidence level of this estimate however is relatively low, and in order to perform a more accurate power analysis, a profile with the actual implemented design signals switching activity should be provided.

Such a switching profile can be obtained by simulating the implemented design in QuestaSim with a benchmark and logging for the design relevant signals the value changes and their timestamps. Such a process is very time consuming and the estimates can still suffer from accuracy issues when compared to the actual power consumed when running on the board. Thus instead, we will monitor and measure the power rails of the PL fabric while running the benchmark on the board. Based on these measurements the energy consumption and peak power are derived. These values are part of the results displayed in the results.txt file after a successful benchmark completion.

Note:

QuestaSim is a proprietary tool that you probably cannot get anyway. Testing it on a real board is probably your only option.

Reporting Results

Your report should have a clear flow from the beginning to the end. Sections should nicely follow each other logically with connecting links from one to the other.


In the introduction you should provide a short summary of the entire work while covering the following items:

  • Motivation for your approach: what is the general idea behind your optimization proposal, how have you decided to make the changes to the core, and why those and no other changes;
  • Changes that you have performed to the cpu, together with their implications
  • Obtained main results
  • Your main conclusions

This section should not include details about your design. At the end of the introduction add the organization (outline) of the report.


Next, include a separate section in which you motivate your choices for improvement in a way which is much more descriptive and comprehensive than what you mentioned in the introduction.

To this end you need to detail:

  • The way you analyzed the baseline processor in order to identify its weak points
  • Your findings based on which you have decided to make the improvements
  • The to be pursued improvement avenues
  • What do you expect from implementing your changes

The following section should provide a description of all the performed changes. Here you need to motivate, in detail, your design choices. For example: Let us presume that you chose to improve the multiplier. Before you go into details regarding the architecture and the implementation of the improved version, you should include a small survey of the different types of multipliers you could choose, and present the reason(s) for your choice. Do not forget to refer to the literature. While completing this section give attention to the following:

  • Include figures to present the architecture of the new blocks that you introduce in the design. It is advisable to start from the top level and go down to each important component, (i.e., component that you have implemented in a special manner), in such a way that it is clear where it is placed and why.

  • Describe the way you embed your designs into the processor. You can draw the action of handshaking signals in the form of timing chart and/or state machine. Bear in mind that handling the interface signals properly is essential in order to allow for a smooth integration of your new modules within the processor.

  • Discuss the design verification aspects you considered.

  • If relevant, attach Vivado simulation results.

  • Summarize the settings of the final processor configuration.


The next section is dedicated to reporting and commenting the experimental results. In this section you should:

  • Provide experimental results and analyze them separately for each modification you did, as well as for combinations of improvements. In this way, you can compare the improvements in terms of cost and benefit against the baseline, against each other, and see how the results add up when you combine different improvements.
  • Report the detailed results ( including timing information, critical path information, resource utilization information from Vivado reports, and power consumption figures ) for all considered designs.
  • Take into account in your analysis basic relevant performance metrics, e.g., area (A), critical path delay (deduced from clock frequency) (D), Benchmarks Scores (BSs), energy consumption (E), as well as compound metrics, e.g., AD, ABS, E, and E*BS products.
  • Comment on the obtained results and try to identify which improvement is contributing to which figure of merit and which proposed improvement is the most effective.

Finally, summarize your work and add your conclusions and possible future work plans in the last report section. In this context you should also put things into perspective and include the following:

  • What was your initial plan;
  • What were your expectations;
  • What are the results and why are the results not according to the expectations in case they are not.

If you have any feedback for us, the course, you can also write this in your conclusion. This is not mandatory, though is greatly appreciated. This feedback does not count towards your grade in any way, though if you are worried that it has your name on it because the report does, we recommend you to fill in our evasys questionnaire which is 100% anynomous.

Some other issues you need to think about when writing your report:

  • The report should be consistent, in structure, language, and formatting style.
  • It is a common practice to make use of the present tense, with the exception of the conclusion section, where past tense should be employed.
  • Headings are written with capitals and numbered.
  • The text is usually justified.
  • References should be placed in a dedicated section at the end of the report.

See also: Submission Information

The CPU and surrounding components

The platform whose block design is illustrated below, is a minimal System-on-a-Chip (SoC) design written in VHDL which consists of the following components:

  • a 32-bit MIPS processor core - cpu (component 1),
  • a boot memory - bram (component 6) and its controller (component 3),
  • a unified 2-KB cache - included in cpu,
  • a DDR memory and its controller -included in the Zynq PS (component 7), and
  • a Universal Asyncronous Receiver/Trasmitter (UART) unit - uart (component 8).

Component 10 consists of sensors and an analog to digital converter that are used for the energy evaluation. Components 2, 4, 5, 9 are either interconnects or interface protocol converters. The Zynq PS - component 7 - generates the clock signal for all the platform components, and the reset signal for the components 9 to 11. Then component 12 generates the reset signal for the rest of the platform. The cpu - component 1 - is based on the open-source Plasma cpu and interfaces available on the OpenCores and GitHub websites. The original Plasma design was updated for our purpose and tailored for the Pynq-Z1 board. All components marked with RTL are VHDL modules with source code given, while all the other components are Intellectual Property (IP) cores provided by Xilinx.

Cpu Platform Diagram

Using the mlite CPU emulator

An emulator of the mlite CPU is provided to debug the benchmarks or derive runtime statistics such as instruction usage. To run the emulator for a specific benchmark, one change needs to be made in the common Makefile. In '/benchmarks/Makefile.commmon', change line 88 from '../../tools/mlite.exe (IMG_FILE) B'. Now, in the directory of a specific benchmark such as '/benchmarks/pi, run the following:

make clean image 
make test_sim

If you want to change the behaviour of the emulator, you can do so by altering and re-compiling 'emulator/mlite.c'.

The Communication Protocol

All platform components communicate among themselves using the Advanced eXtensible Interface (AXI) protocol, which is a standard ARM communication protocol, and the de facto standard adopted by Xilinx for the connection of IPs and functional blocks in SoC designs. There are 3 types of AXI interfaces employed in the SoC platform:

  • AXI3 (component 7 - S_AXI_HP0 port)
  • AXI4LITE (component 8 - axi port)
  • AXI4 for the rest of the platform components

The figure below outlines the AXI4 protocol, architecture and a timing diagram corresponding to a write and a read transaction. Subsequently, we detail only the write transaction handshaking, as the read transaction follows suit. Whenever an AXI Master wants to perform a write transaction, it will send first a set of initial information about the transaction (e.g., burst type, burst size, cacheable attributes of transaction) to the AXI Slave.

At the same time, the AXI Master will send to the AXI Slave the address where the data should be written (AWADDR) and will signal that the driven address is valid (AWVALID). When the AXI Slave is able to accept the address, it will signal back to the AXI Master (AWREADY). The address tranfer from Master to Slave happens when both AWVALID and AWREADY are asserted (clock cycle T3 in the Figure). Similarly to the write address transfer, the write data (WDATA) will be sent from the AXI Master to the AXI Slave (clock cycle T5, T7, T9, and T10 in the Figure). The signal WLAST will mark the last data transfer in the burst. The slave will then respond to the Master if the write transaction was successful or not (BRESP), and assert BVALID when it drives a valid write response (clock cycle T11 in the Figure).

AXI Write transaction AXI Read transaction

The MLite CPU

The MLite CPU is a small synthesizable 32-bit RISC microprocessor that executes all MIPS I [11] user mode instructions except unaligned load and store operations. In the VHDL code the CPU top module, which includes the cpu (mlite_cpu unit), and the AXI4 read and write controllers, is represented by the cpu unit.

Program Counter

The pc_next unit generates the address of the next instruction on its pc_future output port. The value of pc_future can either:

  • be the incremented value of the previous program counter (stored locally in pc_reg)
  • come directly from the mem_ctrl unit (on the opcode25_0 input port), in case of an unconditional branch
  • be computed by the alu unit and received on the pc_new input port, in case of a conditional branch

The selection is performed by the pc_source signal, generated by the control unit. In case of a conditional branch, the take_branch signal is also utilized in the selection: pc_future takes the pc_new value when take_branch is 1, and the previously incremented program counter values when take_branch is 0.

Program Counter Diagram

Memory Interface

The mem_ctrl unit is managing the cpu to memory communication: it sends addresses to memory, and both receives data from and sends data to memory.

The unit is controlled through the mem_source signal, issued by the control unit, to perform one of the following tasks:

  • Instructions fetch: – it sends the appropriate instruction address, received from the pc_next on the address_pc port, to the memory – it receives the instruction opcode on its data_r port; – it delivers the instruction opcode to the control unit on its opcode_out port.
  • Data memory read (for load operations): – it sends to the memory the address of the data to be loaded; – it receives the data from memory and it passes it to the bus_mux unit;
  • Data memory write (for store operations): – it sends to the memory the data and the where to be written address.

Memory Controller Diagram

Decode and Control

The control unit performs instruction decode, based on which it generates the control signals for all the other units.

Control Unit Diagram

The actual logic behind what control signals are sent, depends on the instruction set:

Instruction Set Architecture

Bus

The main task of the bus_mux unit is to perform the functional units input signals multiplexing. In addition, the bus_mux unit also performs the comparison required by the conditional branch instructions, and it generates the branch taken/not taken signal on its take_branch port.

Bus

Register File

The MLite CPU is based on the MIPS I instruction set, hence it embeds 32 32-bit general purpose registers. From the user (compiler) perspective, each register has a specific function, detailed below. You can use the table to find the mapping between the software register name (the one present in the benchmark assembly listing) and the hardware address of the register in the register bank.

The only functions that also need support from the hardware implementation are the following:

  • The value of register R0 is always zero
  • R31 is used as the link register to return from a subroutine

In addition to the general purpose registers there are 4 special registers, also detailed below.

  • HI and LO registers contain the 32-bit MSB and LSB part, respectively, of a 64-bit multiplication/division result.
  • The Program Counter (PC) specifies the address of the next instruction in the program.
  • The Exception Program Counter (EPC) register remembers the program counter when there is an interrupt or exception.
  • There is no status register. Instead, the results of a comparison set a register value, and the branch then tests this register value.
RegisterNameFunction
R0zeroAlways contains 0
R1atAssembler temporary
R2-R3v0-v1Function return value
R4-R7a0-a3Function parameters
R8-R15t0-t7Function temporary values
R16-R23s0-s7Saved registers across function calls
R24-R25t8-t9Function temporary values
R26-R27k0-k1Reserved for interrupt handler
R28gpGlobal pointer
R29spStack Pointer
R30s8Saved register across function calls
R31raReturn address from function call
HI-LOhi-loMultiplication/division results
PCProgram CounterPoints at 8 bytes past current instruction
EPCExceptionPC Exception program counter return address

The interconnection of the reg_bank unit with the rest of the units is pictured below. It is implemented using two dual port memories.

Register File Diagram

Functional Units

The MLite CPU has three functional units:

  • an ALU
  • a Multiplier/Divider
  • a Shifter

The ALU (alu.vhd) executes arithmetic and logic operations, with a delay of 1 clock cycle. The adder in the ALU is described in behavioral VHDL as a ripple-carry adder.

The serial Multiplier/Divider (mult.vhd) takes 32 cycles to compute the 64-bit multiplier result, or the 32-bit quotient and the he 32-bit remainder. The pipeline is stalled during the mul/div operation by asserting the pause_out signal. The Shifter (shifter.vhd) performs left and right bit-shifting in 1 clock cycle.

Pipeline

The pipeline unit contains the pipeline registers (flip-flops) that delay the inputs of the functional units and of the register file write port. Other separation registers between pipeline stages are placed in their corresponding modules.

Frequently Asked Questions

Can somebody take a look at my design? I have some errors that I do not manage to solve.

No, we cannot debug your VHDL programs for you. There are weekly labs in which you can ask questions about technical issues you're facing. However, we generally expect you to debug your own code. We can recommend using version control software to keep track of old versions of your code such that you can go back to it.

Can I have a discussion with you about the design? I have some ideas to improve it.

Yes. There are weekly labs, and there is an intermediate milestone meeting scheduled on May 9th or 10th in which those things can be discussed. Prior and after that, you may email your questions to j.b.doenszelmann@tudelft.nl and/or S.D.Cotofana@tudelft.nl.

One-time path updating for Vitis the first time you launch the tool.

Right click on design_2_wrapper in the Explorer pane, select Build Project. Then right click on design_2_wrapper in the Explorer pane, select Update Hardware Specification, browse and select the .xsa file that you just exported from Vivado (folder fpga/zynq_fpga/workspace).

ERROR in Vitis: Cannot find -lxil.

Right click on the appARMcpu in the Explorer pane, Properties, C/C++ General, Paths and Symbols, Library Paths, and add the path of libxil.a

Simulation gets stuck at the Execute Simulation step.

Add the pdp folder to the exception list for scanning by the antivirus software installed on own computer.