Processor Design Project
This is the homepage for the Processor Design Project. You can find most course information here.
This course is part of the CESE masters programme. This course is a continuation of the curriculum of Computer Arithmetic.
Notice of Changes
This course will be SIGNIFICANTLY changed this year, and the information on this page is no longer correct. It is currently being updated.
Project Overview
- You do this project in groups of 3
- You can choose your own groups
- Once you have chosen a group (you can choose yourself),
- Join a brightspace group
- send an email to j.b.doenszelmann@tudelft.nl with three names and the email addresses linked to your dropbox account. This will give you access to a git repository and the dropbox folder in which you can upload your benchmarks
In this course, you will be tasked to improve the performance of a MIPS-based cpu in groups of 3. To evaluate your new design, you are given a set of benchmarks from the GMPbench and MiBench suites, and remote access to Zynq-7000 FPGA boards. To do so, you will receive:
- A git repository which includes the MIPS processor VHDL code and the benchmarks, the necessary files required for simulation and FPGA implementation, as well as the MIPS cross-compiler
- This website, detailing the project assignment and the simulation general flow and toolset.
Schedule
Date | Activity |
---|---|
April 22nd in EWI/EEMCS Lecture Hall D@ta | Kickoff Meeting |
Every Thursday (except public holidays) | Q&A Lab in EWI Hall M from 13:45 to 17.45 |
May 8th and 9th | Midterm Milestone Meeting |
June 23rd | Report Submission |
TBD. | Project Presentation |
Getting Started
- Find a Group and let us know about it!
- Carefully read Project Workflow. It contains how information on how to do the project.
It has chapters for:
- The FPGA Board Description
- The Vivado, the software you need to install to follow this course
- How to test your design
- If you have questions on how to set up Vivado, or need other support, come to the weekly labs!
- In Platform description describes the MIPS processor you will be modifying in this project.
- In Frequently Asked Questions we are collecting some frequent questions. Please contribute in the labs!
Here You can find the baseline scores for all benchmarks.
Grading Procedure
The projects functionality will be verified and checked for (including between groups) plagiarism.
If the project is not functional you DO NOT pass the course. Plagiarism can also make you fail.
The final score for the project is determined based on the following criteria:
- Design Performance (DP) - The benchmark score is quite important at this point, the higher the better, but this is not the only relevant aspect. We also take into consideration the other metrics with emphasis on the compound ones which give inside on the effectiveness of your proposal, e.g., area overhead vs achieved improvement.
- Technical Merit (TM) - Aspects as innovation level and implementation quality are considered.
- Report (R) - Report organization, content, and language are important aspects at this point.
- Presentation (P) – Presentation organization, slides, and the live performance are the main metrics,
The CESE4040 final grade is computed as:
Staff
The course will be led by Sorin Cotofana, with some assistance of Jonathan Dönszelmann.
Additionally, some TAs will help out: *
Contact
- Jonathan at j.b.doenszelmann@tudelft.nl
- Sorin at S.D.Cotofana@tudelft.nl
Software Setup
Milestone Meeting
The milestone meeting is in person and per group, and it is meant to be a midterm status check. It is NOT optional!
For the milestone meeting you must:
- Get familiarized with the project set-up and the FPGA server.
- Carry on the performance evaluation of the base-line processor design on the FPGA board with the provided benchmark suite.
- Make use of Vivado synthesis and implementation options to optimize the base-line performance without operating any changes on the SoC design (except increasing the clock frequency if the slack allows for it) and redo the FPGA evaluation.
- Perform a Cache Design Space Exploration (DSE). The baseline design includes a 2KB direct mapped cache, which only maps the first 2MB of the main memory. However, the cache design is parametric thus during the DSE process you may vary the cache size, the amount of main memory mapped in the cache, the cache line width, and cache associativity. Moreover, you can also operate other changes on the cache, e.g., change the replacement policy in case of associative caches. During the DSE you need to evaluate the clock frequency, execution cycles, area, and energy consumption of all the considered cache configurations and determine the most effective one, according with your own design goal.
- Identify at least two more possible improvement avenues.
For the milestone meeting you are required to present a short progress report that:
- Provides an overview of the Vivado optimization settings and comments on their impact on platform performance.
- Describes and comments on the impact of memory hierarchy changes on the platform performance.
- Presents the improvement approach that you decided to follow (justify your choices), the expected performance improvement, and the work status (up to date achievements)
Project Submission
The project submission must be done via e-mail.
In your final submission, you should include:
- The report – preferably a pdf - (naming convention:
CESE4040_2023_report_g#
, with#
being the group number) in the root of your git repository. See Reporting - Your git repository (ONLY the
main
branch counts)
The report should include the following:
- The general idea behind you overall optimization proposal. Related to the parts that you decided to change or to append you must answer (at least) these questions: What, how, and why?
- The design of the improved and/or appended part(s). There is no need to go at the gate level! The basic algorithm you decided to implement, and the RTL designs are in general enough. In case some parts are relevant you may go to full adder/gate level if this is essential for your proposal.
- Performance results for the baseline design (the original MIPS system) and your improved core. Those should include the following: Detailed implementation results including timing information, critical path information, resource utilization information from Vivado report, and power consumption figures.
- A comparison between your design and the baseline design in terms of basic metrics, i.e., Area (A), critical path Delay (deduced from clock frequency) (D), Benchmarks Score in terms of execution clock cycles (BS), maximum Power (P), total Energy consumption (E), as well as compound metrics, e.g., , , and products.
- Comments on the obtained results, e.g., answer to “Which improvement is mostly contributing to each figure of merit?”.
- Conclusions.
The report may also include, but it is not mandatory, suggestions for this project! Your feedback is very much appreciated. If you want to do so anonymously, you're also invited to fill in the Evasys questionaire after the course is over.
Methodology
The first question we need to address is “What do we need in order to improve the system?” Starting from one of your targets, e.g., to improve the computational performance of an arithmetic core, you need at least an Electronic Design Automation (EDA) tool to describe your design by using VHDL. An EDA tool is not just a text editor for VHDL, even though most of the time you’ll be struggling in VHDL, it creates the premises to translate your VHDL description into "real" FPGA hardware.
Then you need to evaluate your design, to check its correctness and efficiency, so you need a way to simulate your design and to map it into real hardware. Moreover you need benchmarks to evaluate its correctness and performance.
Since an arithmetic core cannot run on its own, you need a supporting platform to embed it in, and on this platform you can execute the benchmark programs. This platform should be easily implementable in the same way in which your designs are going to be implemented. Concerning the benchmarks, you can develop them by yourself to verify your design, or you can use standard generally acceptable sets to evaluate your design and then to compare it with equivalent counterparts. The benchmark programs are usually written in a programming language with little runtime overhead, like C or Rust for example.
Generally speaking, the supporting system on which the implementation tools and the benchmark compilation are run on, has another instruction set than the general-purpose processor for which you develop your project. Thus you need a dedicated compiler (a "cross compiler" as the target platform is different than the host platform) to compile the benchmark programs for the supporting platform. Once you get the EDA tools, the supporting processor, implement the required hardware, and generate the benchmarks binary via compilation, you can start evaluating your design.
In the best scenario you can simulate your design and synthesize it without any errors or bugs. In that case you can carry on the evaluation and if the obtained performance is satisfactory you can start writing your report. However this is quite unlikely to happen after the first attempt. Most of the time, you will get fatal errors at compilation time, soft errors at execution time, or some strange errors you don’t have a clue where they are coming from. In any of these cases, you need some debugging tools, for hardware or software, and/or for both of them.
FPGA work
Work with an actual FPGA is done remotely. To do so, you should make sure we have you email address that is linked to a dropbox account. More information can be found here.
Project Workflow
To design, debug, and evaluate your improved processor you have to synthesize it on an FPGA board. For that you should follow the following workflow:
In the project you will follow these steps:
- The baseline design (the reference design that you will receive) should be analyzed and its weak point(s) determined by means of, e.g., profiling.
- Avenues for improvement should be identified.
- For each processor component you want to add or change, you should make a pen-and-paper plan capturing its input and output ports and RTL design.
- Write the corresponding HDL code for that component and test whether it performs its function as intended or not (behavioral RTL simulation).
- Only then, you can test the design on the FPG. For that the following steps need to be taken
- Logic synthesis, converting the RTL constructs from the HDL code into a design implementation in terms of generic resources (LUTs, MUXes, memory, flip-flops, adders, etc.).
- Map the design generic resources into the specific resources available on the Zynq FPGA (e.g., slices, IOBs)
- "Place and Route", wiring components together, such that the design criteria imposed (such as pin constraints for components ports and timing constraaints) are being met.
- Load the resulting "bit file" onto the FPGA, to test.
Behavioral testing
Generally speaking, it is advisable to write a HDL testbench and test each component individually, before integrating it inside the system. A bit like unit tests in software.
The alternative would be to opt for testing the entire system, maybe through the c programs we provide. Though possible, it will be harder to debug your design that way.
To evaluate your processor, a set of benchmarks will be executed. A software application that will initialize the board, send the benchmark to the DDR memory, and finally collect the benchmark results from the board and further display them is also required.
This application can be executed on your computer, to test in the provided emulator. However, once you uploaded the bitstream to the FPGA board, the same code can be executed there, since the board also incorporates an ARM processor. The system timing performance, area, and energy consumption can now be evaluated and appropriate further actions undertaken.
While doing this project, we strongly recommend that you do the following:
- Make a clear plan for your design before you start writing VHDL code.
- Run a module level simulation every time you modify something in your VHDL code, before you start running a system level simulation.
- Run a system simulation every time you change your design before you synthesize it!
- Put your source code in git, so that you can restore old versions.
To evaluate your enhanced MIPS processor, a Pynq-Z1 board (reference manual), will be used.
On this board, there's a Zynq-7000 SoC chip (reference manual), which is divided into two distinct subsystems:
- The Programmable Logic (PL), which includes an array of programmable logic blocks and a set of dedicated resources (e.g., block RAM memories for dense memory requirements, DSPs for high speed arithmetic, analog-to-digital converters)
- The Processing System (PS), whose key components include:
- hard silicon ARM Cortex-A9 processors
- clock generators
- DDR and memory controllers
- I/O peripherals (e.g., GPIO, UART)
- PS-PL connectivity interfaces.
Vivado installation
The software we use for simulation and to synthesize a design for the FPGA is Xilinx Vivado. Installers are available for both Linux Mac and Windows. On linux, we recommend against trying to install vivado through your package manager, since we require a specific version of the software. You can download the installer at the following address https://www.xilinx.com/support/download. Make sure you get version 2023.2 ("Vivado Design Suite - HLx 2023.2") (though likely 2022.2 also works). Before downloading, you may need to create an account. During installation, select Vitis as the edition to be installed, with the following components:
The installation requires between 50GB and 100GB, and doesn’t need a license after installation to be used.
Cable drivers
Vivado might complain about missing cable drivers on linux. To install:
cd <vivado_install_dir>/data/xicom/cable_drivers/lin64/install_script/install_drivers/
sudo ./install_drivers
Read more at https://docs.amd.com/r/en-US/ug973-vivado-release-notes-install-license/Installing-Cable-Drivers
Functional Verification
All the vhdl code (except the benchmarks files) is located in the rtl
folder.
The sim
folder contains all files required for simulation,
and the fpga
folder contains all files needed for FPGA implementation.
To verify the correctness of your HDL dsign, you should run both module level and system level simulation. To perform module level verification, you have to write a testbench yourself. That test bench can then test a single module of your design.
System level verification
For system level verification, we have already provided you with a top-level testbench.
You can find it in sim/testbench_vivado_1.vhd
. To run it, follow these steps:
- Open the vivado project, in vivado, not vitis. You can find it in
sim/_zynq_sim/synq_sim.xpr
. - Find the Flow Navigator pane, it's left of the main window.
- Select
IP INTEGRATOR → Open Block Design
. This will open the block diagram of your HDL design, showing you the address map of its components.- If you double click on a component here, you can change its parameters. However, be careful if you do that, because if you change a parameter that corresponds to a value in VHDL, you must also change that value in VHDL, otherwise it will not actually be updated.
- Select a program to test with. By default this is
benchmarks/opcodes/opcodes.asm
. However, you can change which one it uses inrtl/platform/bram.vhd
. For example, you can set it to runpi
, which is the name of another benchmark.- The benchmarks are programs written in c, which can be compiled for our simulated MIPS processor, generating a vhd file with the mips instructions in RAM. Check
benchmarks/README.md
for further instructions andsim/main_pack_opcodes.vhd
for an example vhd. - The simulator outputs over UART, which by default is saved in
sim/zynq_sim/zynq_sim.sim/sim_1/behav/xsim/uart_output.txt
. However, if you'd like to change this you can do so by changing the Log File parameter of the uart component.
- The benchmarks are programs written in c, which can be compiled for our simulated MIPS processor, generating a vhd file with the mips instructions in RAM. Check
- Start the simulation
- Go back to the Flow Navigator pane
- Select
SIMULATION → Run Simulation → Run Behavioural Simulation
.
At this point, you can select which signals you'd like to inspect in the waveform output of your design. In the image below, you can see how to do this.
Now, you can either run the simulation until manually interrupted (the Run All
button) or by setting the simulation time in the textbox and pressing the smaller arrow besides it. To get an idea of the timescales we're working with, the main_pack_opcodes
benchmark takes around 35ms.
NOTE: Every time you make a change in the RTL code, the simulation should be relaunched.
For more information, this manual, and this tutorial might help you.
Once your design passes system level verification, you can continue to the next step: Synthesizing your design. However, if the tests don't pass, you will need to revisit your HDL code and redo the steps in this chapter.
Custom Tests
The assembly code corresponding to a test of all possible ISA opcodes is provided in benchmarks/opcodes/opcodes.asm
.
Although it can help you debug your design, you might want to make some finer grained tests than that.
For that reason, we also include a cross-compiler and assembler for the MIPS architecture you're working on.
To make your own benchmark written in C, you can change the c code in benchmarks/custom
.
To run these, execute
cd benchmarks/custom
make clean image
A corresponding .vhd
file will be generated in the same folder.
Modifying the ISA
This section is incomplete. Older manuals did have it, but the information got outdated. If you want to do this, send an email to S.D.Cotofana@tudelft.nl, also to get the compiler source code files.
If you choose to modify the instruction set of the processor you're building, you will also need to change the compiler for it. Otherwise, the new instructions you add will not be used by the compiler, or if you remove instructions, the compiler will still happily generate them.
To do this, we provide the source of the MIPS cross compiler: mips-pdp-elf-....
,
as well as tools to build this cross compiler itself (crosstool-NG
).
The required compiler modifications should be made in the form of patches which should be placed in the folder toolchain-ctng/ctng-bin/lib/crosstool-ng-1.22.0/patches/my_patches
.
The source packages used to build the cross-compiler are available in the folder toolchain-ctng/ctng_toolchain_src. It is advisable not to overwrite the original compiler, but to run
ct−ng menuconfig
Synthesizing your Design
To synthesize and implement your design, open the Vivado project file (zynq_fpga.xpr) in the folder
fpga/zynq_fpga
.
The Flow Navigator pane (left of the window) contains all the available FPGA implementation flow steps.
Using this pane works in a number of steps:
- functional verification
- updating the block diagram to reflect the parameters settings from the simulation block diagram and validating this
- generating the bitstream
This is accomplished from the Flow Navigator pane, either step by step allowing to analyze the results of each step:
SYNTHESIS → Run Synthesis;
IMPLEMENTATION → Run Implementation;
PROGRAM AND DEBUG → Generate Bitstream
Or directly by clicking PROGRAM AND DEBUG → Generate Bitstream
,
which will automatically execute all the previously mentioned steps.
Upon a successful completion the bistream file (.bit) will be created.
Normally, block diagram output products necessary for synthesis are automatically re-generated every time. However, if for some reason the older files are used, the output products can be forcefully re-generated prior to running synthesis by selecting in the Flow Navigator pane
IP INTEGRATOR → Generate Block Design
and then selecting Global as the Synthesis Options.
Vivado FPGA Implementation Strategies
Depending on the algorithms and overall strategies selected within the Vivado tool synthesis and implementation* menu, slightly different FPGA implementations (with different area utilization and timing performance) can be obtained.
A strategy is a set of Vivado tool settings,
which specify the design flows and the optimization levels.
Vivado provides a set of predefined strategies for synthesis and implementation.
Alternatively, you can also create your own strategies
Vivado's Synthesis & Implementation settings, displayed below,
can be accessed from the Flow Navigator
pane via Settings.
While synthesis consists of just a single subprocess,
the implementation step has multiple such subprocesses,
each consisting of a series of steps which can be optimized with specified effort level (-directive
setting).
Identifying effective strategies can lead to best results based on specified design goals.
For example, you can use this when you have a performance oriented goal.
By changing the Synthesis and Implementation strategy,
minor timing closure violations (e.g., small negative slack) can be resolved.
This can even lead to a higher maximum clock frequency for the design.
For further details about the Vivado tool Synthesis and Implementation strategies, please refer to the
Vivado Design Suite User Guide - Synthesis and
Vivado Design Suite User Guide - Implementation.
Processor Evaluation
Once you obtain a bug free design you may proceed to its evaluation. Some information can be found in the reports of vivado, such as the maximum operating frequency and the area of your design. Other results can only be found by executing benchmarks on the actual FPGA.
Area Evaluation
Below, an example of a design floorplan is given. This is the design floorplan of the baseline CPU we gave you, with the Programmable Logic (PL) and the Processing System (PS) sections outlined.
The main resource for combinatorial and sequential circuits are Configurable Logic Blocks (CLBs). Each CLB consists of 2 interconnected slices, as shown in the example above, which contain the following resources:
- 8 look-up tables for random logic implementation or distributed memory,
- 6 multiplexers,
- 2 fast carry chains,
- 16 Flip-Flops out of which 8 can be configured as latches.
For memory implementation, besides the distributed RAM (from CLBs),
a number of 36Kb RAM Blocks (RAMBs) are available
(see Vivado Design Suite 7 Series FPGA and Zynq-7000 SoC Libraries Guide for BRAMs instantiation primitives).
To estimate the area of your design,
a compound value should be derived from the implemented design's Utilization Report.
You can generate this for the cpu_0
component through Vivado's IMPLEMENTATION → Open Implemented Design → Report Utilization
.
Below we've shown a sample report of the baseline cpu:
You can then use this information to estimate the total area using the following relations:
- The area of a slice is half the area of a CLB
- The area of a a
RAMB18
(18 kbit configurable ram block) is 1.2 times the size of a CLB - The area of a a
RAMB36
(36 kbit configurable ram block) is 2.4 times the size of a CLB
In the example from above, cpu_0
doesn't use any block ram tiles,
but 991 slices so the total area is
Timing Analysis
To check that the implemented design works at the requested frequency,
a timing report should be generated in Vivado through IMPLEMENTATION → Open Implemented Design → Report Timing Summary
.
A timing tab will open up detailing all the timing constraints
(usually one for each clock signal in the design)
with the top slowest propagation paths and any potential timing violations.
Below, we've shown the timing summary section of the baseline cpu:
This shows us whether all the timing constraints are met (slack values should be positive to meet timing) and thus that the design will be able to run in a reliable manner at the requested frequency. If the design does not meet the desired frequency, it is recommended to focus on the Worst Negative Slack (WNS) as the main way to improve the total negative slack.
If the slack is positive, the design is guaranteed to work in a reliably at the requested frequency. However, if the slack is negative the design might work within reasonable negative slack bounds, or it might not work properly at all. Based on the WNS value, we can also estimate the maximum attainable operation frequency as follows:
This FMAX estimation could be pessimistic, as Vivado does not try to further improve timing after the timing constraints are met.
Clicking on the value of WNS will display information for the top 10 paths with the worst delay in the design. Double clicking on the critical path (Path 1) will display detailed timing related information, among which the path start point and end point, the number of levels of logic, the logical resources included in the path, and the slack with respect to the requested frequency, as shown below.
Right-clicking on Path 1, and selecting Schematic will display the critical path schematic, as illustrated.
Further information on how to read a timing report can be found in Vivado documentation
Executing Benchmarks
To evaluate your improvements of the cpu, you are given a set of benchmarks to be run on the FPGA board. After benchmark execution on the board the message "CORRECT!" is displayed if the benchmark results are as expected. Otherwise, an "ERROR" message is displayed.
In addition, to evaluate the performance of your solution, the number of CPU cycles (in million cycles) consumed by the benchmark execution, the peak power consumption (in mW), and the energy (in J) are measured and displayed.
The scores of the baseline cpu for these benchmarks are shown in the table below:
Baseline benchmark scores
Name | Description | cycles (million) | Time (s) | energy (J) | peak power (mW) |
---|---|---|---|---|---|
opcodes | Tests all MIPS I instructions (only for testing) | - | - | - | - |
cjpeg | JPEG compression | 100.018590 | 1.6952 | 1.722011 | 79.197090 |
divide | Large number (192Kb) integer division using GMP library | 707.854012 | 11.9975 | 2.467115 | 78.970367 |
multiply | Large number (64Kb) integer multiplication using GMP library | 390.739735 | 6.6227 | 2.187528 | 81.413025 |
pi | Computes 1000 digits of PI using basic arithmetic operations | 884.506178 | 14.9916 | 2.784943 | 81.311775 |
fir | Length-63 bandpass FIR filter applied to 50000 input samples | 213.740789 | 3.6227 | 1.696708 | 72.882149 |
rsa | RSA message signing using GMP library | 1420.508182 | 24.0764 | 3.230663 | 79.212349 |
ssd | Pattern matching using Sum-of-Squared-Differences | 2305.992254 | 39.0846 | 4.397920 | 82.596886 |
ssearch | String search using look-up tables | 1419.875462 | 24.0657 | 3.343137 | 81.810158 |
susan | Gaussian image smoothing | 1601.722838 | 27.1478 | 3.540810 | 82.110268 |
benchall | All Benchmarks in one run | 8654.753342 | 146.6907 | 11.774730 | 81.538254 |
The C and assembly source code for each benchmark can be found in the benchmarks folder. Note that if your cpu performance improvement strategy relies (also) on MIPS ISA modification/augmentation, then you need to recompile the benchmarks before running them.
In order to execute the benchmarks on your modified cpu, 2 things are needed:
- the bitstream file (
.bit
) (generated in Vivado) which is to be uploaded to the board and used to configure the FPGA - the benchmarks
.bin
file.
These two files are stored on the ZYNC's SD card. To manage the benchmarks execution, the ARM cpu on the Zynq board is used. Specifically, the ARM cpu tasks are: initialize the board interfaces, transfer the benchmark from SD card into the DDR memory, start the benchmark execution on the processor, receive the benchmark results from the processor, and redirect them in an external file. The ARM cpu tasks are performed according to a c program created in Vitis.
After the bistream generation has completed successfully in Vivado,
you can use the workflow shown below to create a wrapper bitsteam called design_2_wrapper.bit
.
The steps below line up with the ones highlighted in the picture below.
- Create a compound file
design_2_wrapper.xsa
, which should be infpga/zynq_fpga
. This file contains the bitstream and other files necessary for the initialization of comonents like clocks, DDR and GPIO. - You can skip this step when you don't have physical access to the ZYNC like what is the case during this course.
- This one as well.
- Launch vitis and make it aware of the changed hardware specification
design_2_wrapper.xsa
. - Next you should change a line in the code that will run on the ARM cpu next to the FPGA. (source code in
appARMcpu_main.c
). You should set of the name of the benchmark you want to run. This c program will allow:- To read from the board SD card the target benchmark that will be evaluated on your processor and to transfer it to the PS DDR memory
- To initialize the PS components (clocks, DDR, GPIO), provide the PL clock and reset the PL logic
- to receive the benchmark results from the UART and redirect them to an external file.
- Compile this program through vitis to get an executable:
appARMcpu.elf
. Of course you can also skip step 6b since you don't have a board physically connected to your laptop.
You're now done with Vitis, and you can close it.
Running Benchmarks Remotely
The communication of input/output files for remote FPGA access is performed via Dropbox shared folders.
Preliminary Dropbox folder setup:
For each group, a shared folder will be created on dropbox. In order to access the group folder, you are required to have Dropbox (www.dropbox.com) account(s) and provide us with the email addresses linked to your Dropbox accounts. Please send an email to j.b.doenszelmann@tudelft.nl with your group's names and email addresses which are linked to your dropbox accounts. A shared dropbox folder can then be accessed for upload/download either locally (if the Dropbox client is locally installed), or online via the Dropbox website.
The remote access flow for verifying the FPGA implemented design consists of the following steps:
- Place the
design_2_wrapper.xsa
(located typically in thefpga/zynq_fpga
folder) andappARMcpu.elf
(located in thefpga/zynq_fpga/workspace/appARMcpu/Debug
folder) files in the Dropbox folder. NOTE: Only files with these names are accepted. - The FPGA will be programmed automatically with the bistream
design_2_wrapper.bit
, and the ARM cpu prepared for executing theappARMcpu.elf
application. Afterwards, these files will be automatically deleted from the Dropbox folder. NOTE: There are multiple FPGA boards. The scheduling time for programming each bitstream on an FPGA is compliant with a round robin scheduling policy, relative to the remaining Dropbox groups' folders and the individual bitstream files timestamp. - For each programmed bitstream, two output files will be generated in the Dropbox folder:
- results.txt file with the benchmark execution related results (e.g., status of the benchmark results correctness, performance) – log.txt file consisting of ERRORS/INFO/WARNINGS concerning the status of the enti re simulation (e.g., files consistency, FPGA programming, UART receive/transmit).
In both .txt files, "DONE" marks a successful end of communication with the remote FPGA board.
Warning:
If you want to preserve the .txt files, you have to save them in a separate folder, as the next evaluation will overwrite them.
Custom Benchmarks
For testing a custom benchmark three files have to be placed in the Dropbox folder:
- design_2_wrapper.xsa,
- appARMcpu.elf (after being modified in Vitis to update the benchmark name to ’custom.bin’),
- and a file called
custom.bin
(which is generated using the provided MIPS cross-compiler).
Energy Evaluation
The Vivado tool can provide a power consumption estimate of your design,
obtained through IMPLEMENTATION → Open Implemented Design → Report Power
.
To derive the dynamic power consumption estimate,
the Vivado tool, by default,
does not require the user to specify any information related to the switching profile of the design nets (default switching rates are being assumed).
The confidence level of this estimate however is relatively low,
and in order to perform a more accurate power analysis,
a profile with the actual implemented design signals switching activity should be provided.
Such a switching profile can be obtained by simulating the implemented design in QuestaSim with a benchmark and logging for the design relevant signals the value changes and their timestamps.
Such a process is very time consuming and the estimates can still suffer from accuracy issues when compared to the actual power consumed when running on the board.
Thus instead, we will monitor and measure the power rails of the PL fabric while running the
benchmark on the board.
Based on these measurements the energy consumption and peak power
are derived.
These values are part of the results displayed in the results.txt
file after a successful
benchmark completion.
Note:
QuestaSim is a proprietary tool that you probably cannot get anyway. Testing it on a real board is probably your only option.
Reporting Results
Your report should have a clear flow from the beginning to the end. Sections should nicely follow each other logically with connecting links from one to the other.
In the introduction you should provide a short summary of the entire work while covering the following items:
- Motivation for your approach: what is the general idea behind your optimization proposal, how have you decided to make the changes to the core, and why those and no other changes;
- Changes that you have performed to the cpu, together with their implications
- Obtained main results
- Your main conclusions
This section should not include details about your design. At the end of the introduction add the organization (outline) of the report.
Next, include a separate section in which you motivate your choices for improvement in a way which is much more descriptive and comprehensive than what you mentioned in the introduction.
To this end you need to detail:
- The way you analyzed the baseline processor in order to identify its weak points
- Your findings based on which you have decided to make the improvements
- The to be pursued improvement avenues
- What do you expect from implementing your changes
The following section should provide a description of all the performed changes. Here you need to motivate, in detail, your design choices. For example: Let us presume that you chose to improve the multiplier. Before you go into details regarding the architecture and the implementation of the improved version, you should include a small survey of the different types of multipliers you could choose, and present the reason(s) for your choice. Do not forget to refer to the literature. While completing this section give attention to the following:
-
Include figures to present the architecture of the new blocks that you introduce in the design. It is advisable to start from the top level and go down to each important component, (i.e., component that you have implemented in a special manner), in such a way that it is clear where it is placed and why.
-
Describe the way you embed your designs into the processor. You can draw the action of handshaking signals in the form of timing chart and/or state machine. Bear in mind that handling the interface signals properly is essential in order to allow for a smooth integration of your new modules within the processor.
-
Discuss the design verification aspects you considered.
-
If relevant, attach Vivado simulation results.
-
Summarize the settings of the final processor configuration.
The next section is dedicated to reporting and commenting the experimental results. In this section you should:
- Provide experimental results and analyze them separately for each modification you did, as well as for combinations of improvements. In this way, you can compare the improvements in terms of cost and benefit against the baseline, against each other, and see how the results add up when you combine different improvements.
- Report the detailed results ( including timing information, critical path information, resource utilization information from Vivado reports, and power consumption figures ) for all considered designs.
- Take into account in your analysis basic relevant performance metrics, e.g., area (A), critical path delay (deduced from clock frequency) (D), Benchmarks Scores (BSs), energy consumption (E), as well as compound metrics, e.g., AD, ABS, E, and E*BS products.
- Comment on the obtained results and try to identify which improvement is contributing to which figure of merit and which proposed improvement is the most effective.
Finally, summarize your work and add your conclusions and possible future work plans in the last report section. In this context you should also put things into perspective and include the following:
- What was your initial plan;
- What were your expectations;
- What are the results and why are the results not according to the expectations in case they are not.
If you have any feedback for us, the course, you can also write this in your conclusion. This is not mandatory, though is greatly appreciated. This feedback does not count towards your grade in any way, though if you are worried that it has your name on it because the report does, we recommend you to fill in our evasys questionnaire which is 100% anynomous.
Some other issues you need to think about when writing your report:
- The report should be consistent, in structure, language, and formatting style.
- It is a common practice to make use of the present tense, with the exception of the conclusion section, where past tense should be employed.
- Headings are written with capitals and numbered.
- The text is usually justified.
- References should be placed in a dedicated section at the end of the report.
See also: Submission Information
The CPU and surrounding components
The platform whose block design is illustrated below, is a minimal System-on-a-Chip (SoC) design written in VHDL which consists of the following components:
- a 32-bit MIPS processor core - cpu (component 1),
- a boot memory - bram (component 6) and its controller (component 3),
- a unified 2-KB cache - included in cpu,
- a DDR memory and its controller -included in the Zynq PS (component 7), and
- a Universal Asyncronous Receiver/Trasmitter (UART) unit - uart (component 8).
Component 10 consists of sensors and an analog to digital converter that are used for the energy
evaluation.
Components 2, 4, 5, 9 are either interconnects or interface protocol converters.
The Zynq PS - component 7 - generates the clock signal for all the platform components,
and the reset signal for the components 9 to 11.
Then component 12 generates the reset signal for the rest of the platform.
The cpu - component 1 - is based on the open-source Plasma cpu and interfaces available on the OpenCores and GitHub websites.
The original Plasma design was updated for our purpose and tailored for the Pynq-Z1 board.
All components marked with RTL
are VHDL modules with source code given,
while all the other components are Intellectual Property (IP) cores provided by Xilinx.
Using the mlite CPU emulator
An emulator of the mlite CPU is provided to debug the benchmarks or derive runtime statistics such as instruction usage. To run the emulator for a specific benchmark, one change needs to be made in the common Makefile. In '/benchmarks/Makefile.commmon', change line 88 from '../../tools/mlite.exe (IMG_FILE) B'. Now, in the directory of a specific benchmark such as '/benchmarks/pi, run the following:
make clean image
make test_sim
If you want to change the behaviour of the emulator, you can do so by altering and re-compiling 'emulator/mlite.c'.
The Communication Protocol
All platform components communicate among themselves using the Advanced eXtensible Interface (AXI) protocol, which is a standard ARM communication protocol, and the de facto standard adopted by Xilinx for the connection of IPs and functional blocks in SoC designs. There are 3 types of AXI interfaces employed in the SoC platform:
- AXI3 (component 7 - S_AXI_HP0 port)
- AXI4LITE (component 8 - axi port)
- AXI4 for the rest of the platform components
The figure below outlines the AXI4 protocol, architecture and a timing diagram corresponding to a write and a read transaction. Subsequently, we detail only the write transaction handshaking, as the read transaction follows suit. Whenever an AXI Master wants to perform a write transaction, it will send first a set of initial information about the transaction (e.g., burst type, burst size, cacheable attributes of transaction) to the AXI Slave.
At the same time, the AXI Master will send to the AXI Slave the address where the data should be written (AWADDR) and will signal that the driven address is valid (AWVALID). When the AXI Slave is able to accept the address, it will signal back to the AXI Master (AWREADY). The address tranfer from Master to Slave happens when both AWVALID and AWREADY are asserted (clock cycle T3 in the Figure). Similarly to the write address transfer, the write data (WDATA) will be sent from the AXI Master to the AXI Slave (clock cycle T5, T7, T9, and T10 in the Figure). The signal WLAST will mark the last data transfer in the burst. The slave will then respond to the Master if the write transaction was successful or not (BRESP), and assert BVALID when it drives a valid write response (clock cycle T11 in the Figure).
The MLite CPU
The MLite CPU is a small synthesizable 32-bit RISC microprocessor that executes all MIPS I [11] user mode instructions except unaligned load and store operations. In the VHDL code the CPU top module, which includes the cpu (mlite_cpu unit), and the AXI4 read and write controllers, is represented by the cpu unit.
Program Counter
The pc_next
unit generates the address of the next instruction on its pc_future output port.
The value of pc_future can either:
- be the incremented value of the previous program counter (stored locally in
pc_reg
) - come directly from the mem_ctrl unit (on the
opcode25_0
input port), in case of an unconditional branch - be computed by the alu unit and received on the
pc_new
input port, in case of a conditional branch
The selection is performed by the pc_source
signal, generated by the control unit.
In case of a conditional branch, the take_branch
signal is also utilized in the selection:
pc_future
takes the pc_new
value when take_branch is 1,
and the previously incremented program counter values when take_branch
is 0.
Memory Interface
The mem_ctrl
unit is managing the cpu to memory communication:
it sends addresses to memory, and both receives data from and sends data to memory.
The unit is controlled through the mem_source
signal,
issued by the control unit,
to perform one of the following tasks:
- Instructions fetch:
– it sends the appropriate instruction address, received from the
pc_next
on theaddress_pc
port, to the memory – it receives the instruction opcode on its data_r port; – it delivers the instruction opcode to the control unit on its opcode_out port. - Data memory read (for load operations): – it sends to the memory the address of the data to be loaded; – it receives the data from memory and it passes it to the bus_mux unit;
- Data memory write (for store operations): – it sends to the memory the data and the where to be written address.
Decode and Control
The control unit performs instruction decode, based on which it generates the control signals for all the other units.
The actual logic behind what control signals are sent, depends on the instruction set:
Bus
The main task of the bus_mux unit is to perform the functional units input signals multiplexing.
In addition, the bus_mux
unit also performs the comparison required by the conditional branch instructions,
and it generates the branch taken/not taken signal on its take_branch port.
Register File
The MLite CPU is based on the MIPS I instruction set, hence it embeds 32 32-bit general purpose registers. From the user (compiler) perspective, each register has a specific function, detailed below. You can use the table to find the mapping between the software register name (the one present in the benchmark assembly listing) and the hardware address of the register in the register bank.
The only functions that also need support from the hardware implementation are the following:
- The value of register R0 is always zero
- R31 is used as the link register to return from a subroutine
In addition to the general purpose registers there are 4 special registers, also detailed below.
- HI and LO registers contain the 32-bit MSB and LSB part, respectively, of a 64-bit multiplication/division result.
- The Program Counter (PC) specifies the address of the next instruction in the program.
- The Exception Program Counter (EPC) register remembers the program counter when there is an interrupt or exception.
- There is no status register. Instead, the results of a comparison set a register value, and the branch then tests this register value.
Register | Name | Function |
---|---|---|
R0 | zero | Always contains 0 |
R1 | at | Assembler temporary |
R2-R3 | v0-v1 | Function return value |
R4-R7 | a0-a3 | Function parameters |
R8-R15 | t0-t7 | Function temporary values |
R16-R23 | s0-s7 | Saved registers across function calls |
R24-R25 | t8-t9 | Function temporary values |
R26-R27 | k0-k1 | Reserved for interrupt handler |
R28 | gp | Global pointer |
R29 | sp | Stack Pointer |
R30 | s8 | Saved register across function calls |
R31 | ra | Return address from function call |
HI-LO | hi-lo | Multiplication/division results |
PC | Program Counter | Points at 8 bytes past current instruction |
EPC | Exception | PC Exception program counter return address |
The interconnection of the reg_bank unit with the rest of the units is pictured below. It is implemented using two dual port memories.
Functional Units
The MLite CPU has three functional units:
- an ALU
- a Multiplier/Divider
- a Shifter
The ALU (alu.vhd
) executes arithmetic and logic operations, with a delay of 1 clock cycle.
The adder in the ALU is described in behavioral VHDL as a ripple-carry adder.
The serial Multiplier/Divider (mult.vhd
) takes 32 cycles to compute the 64-bit multiplier result,
or the 32-bit quotient and the he 32-bit remainder.
The pipeline is stalled during the mul/div operation by asserting the pause_out signal.
The Shifter (shifter.vhd
) performs left and right bit-shifting in 1 clock cycle.
Pipeline
The pipeline unit contains the pipeline registers (flip-flops) that delay the inputs of the functional units and of the register file write port. Other separation registers between pipeline stages are placed in their corresponding modules.
Frequently Asked Questions
Can somebody take a look at my design? I have some errors that I do not manage to solve.
No, we cannot debug your VHDL programs for you. There are weekly labs in which you can ask questions about technical issues you're facing. However, we generally expect you to debug your own code. We can recommend using version control software to keep track of old versions of your code such that you can go back to it.
Can I have a discussion with you about the design? I have some ideas to improve it.
Yes. There are weekly labs, and there is an intermediate milestone meeting scheduled on May 9th or 10th in which those things can be discussed. Prior and after that, you may email your questions to j.b.doenszelmann@tudelft.nl and/or S.D.Cotofana@tudelft.nl.
One-time path updating for Vitis the first time you launch the tool.
Right click on
design_2_wrapper
in the Explorer pane, select Build Project. Then right click ondesign_2_wrapper
in the Explorer pane, select Update Hardware Specification, browse and select the.xsa
file that you just exported from Vivado (folderfpga/zynq_fpga/workspace
).
ERROR in Vitis: Cannot find -lxil.
Right click on the
appARMcpu
in the Explorer pane, Properties, C/C++ General, Paths and Symbols, Library Paths, and add the path of libxil.a
Simulation gets stuck at the Execute Simulation step.
Add the pdp folder to the exception list for scanning by the antivirus software installed on own computer.