Lab 2: RISC-V
Welcome to Lab 2, a medium-sized Verilog project focused on FPGA development using Vivado. In this lab, you will be working with a custom RISC-V CPU core and implementing a Scaled Index (SI) load instruction to extend its functionality.
DOWNLOAD THE TEMPLATE : PicoRV32_template. Inside you will find a Baseline project, which you will use to implement the new instruction.
IMPORTANT VIVADO NOTE : When you make any changes in the Verilog file, you need to open the block diagram and
Refresh the Changed Modules
ORUpdate IP
. If not done your generated bitstream will be built using the OLD Verilog code. This is a Vivado tool requirement and we cannot do anything about this.
Table of Content
- Lab Evaluation Method
- Part 1: Understanding RISC_V and PicoRV32 core
- Part 2: Adding an instruction
- Part 3: Testing and Profiling the new instruction
- BONUS: Answer the following Questions
Learning Goals of the Lab
- Introduction to RISC-V Architecture
- Understanding how instructions are executed on a processor
- Implementing a custom instruction and understanding the impact of this
- Introduction to Vitis Classic and understanding how memory works in Vivado
Lab Evaluation Method
Each Part of this Lab has questions that you need to answer as a GROUP. Once you have finished a Part, you can answer the questions and call a TA to get it verified (if TAs are busy and you are confident in your answers, you can move on to the next Part, but get it verified before the end of the lab session).
We suggest you carry pen and paper (or note-taking tablet) with you to the lab to answer these questions.
Part 1: Understanding RISC-V and PicoRV32 core
You may do this part of the assignment before the lab session.
The link to download the template is provided at the beginning of this page!
In this section, you will understand how RISC-V instructions are structured and how the PicoRV32 executes an instruction. Answer the following questions and once completed, call a TA for sign-off.
- Decode the following instruction and write it in assembly format using RISC-V Instruction Card and RISC-V Register Map
32'b00000000000100111110111010010011
- Convert the assembly code into binary machine code, with clear demarcation for the various sections of the instruction using RISC-V Instruction Card and RISC-V Register Map:
add t0, a2, s3
-
Understanding the
load_word
instruction implementation in the PicoRV32 Processor(a). What two local registers are used to uniquely identify the load word instruction? What fields of instruction are used for this? (You can refer to PicoRV32 Instruction Decoder)
(b). Draw a block diagram showing all the CPU state transitions involved in a load word instruction where the first and last cpu-state is
fetch
. (You can refer to PicoRV32 CPU States)(c). What does the
cpu_state_ldmem
do? List all the instructions that use thiscpu_state_ldmem
state as part of their execution (Refer topicorv32.v
line no1856
)
Part 2: Adding an instruction
In this Part, you will be adding a Scaled Indexed Load instruction, formally known as Load Word Indexed (LWI) (not "immediate" if you are familiar with MIPS assembly). To add this custom instruction to the RISC-V core, you will have to modify the Verilog code of the processor. The processor should be able to decode the new instruction and execute the expected operation when the instruction is called.
Understanding the Template
The link to download the template is provided at the beginning of this page!
The template provided has additional components other than the PicoRV32 core. You should read through the Project Structure section to get a better understanding of how the project works and how the PicoRV32 core executes code.
Running the Baseline
For groups who do not have Vivado installed in their laptops, you can use the desktops in the Lab. To use the desktops, follow the steps explained in Using the Desktop
Before making changes in the code, you should try simulating the baseline project. Please follow these steps:
-
Read and understand the baseline code shown below:
1 _start: 2 # Load base address of 0x4000_0050 into t0 3 lui t0, 0x40000 # Load upper 20 bits of address into t0 4 addi t0, t0, 0x50 # Add lower 12 bits of address to t0 to get 0x4000_0050 5 6 # Load the value at address 0x4000_0050 into t1 7 lw t1, 0(t0) # Load the 32-bit value from address 0x4000_0050 into t1 8 9 # Load the value at address 0x4000_0054 into t2 10 lw t2, 4(t0) # Load the 32-bit value from address 0x4000_0054 into t2 11 12 # Perform addition 13 add t3, t1, t2 # Add t1 and t2, store result in t3 14 15 # Infinite loop with nop instructions 16 loop: 17 nop # No operation 18 j loop # Jump to the start of the loop
-
Open
testbench.v
and verify that it reads the baseline memory as below and save the file. This will ensure that the simulator loads the unmodified memory module.57. // Baseline ISA 58. $readmemh("memory_data.mem", mem.memory_array);
-
Follow
Flow navigator → SIMULATION → Run Simulation
and wait until you see the wave outputs. -
By looking at the OPCODE signal, you can verify that instructions are being decoded one after another. There is no need to manually decode the instructions.
-
Note down the values stored in t2 and t3. Right click on the signal to change radix. And compare with the Expected Simulation Output.
LWI Instruction
The Indexed Load instruction, described by the opcode 0101011
, is used for advanced memory addressing techniques. It uses rs1
and rs2
to calculate the effective memory address. Two registers instead of one and an offset when compared to the lw
instruction.
The lwi
instruction, short for "Load Word Indexed" has a specific encoding in the RISC-V instruction set. The instruction is encoded as follows:
31 ... 25 | 24 ... 20 | 19 ... 15 | 14 ... 12 | 11 ... 7 | 6 ... 0 |
---|---|---|---|---|---|
funct7 | rs2 | rs1 | funct3 | rd | opcode |
0000000 | rs2 | rs1 | 010 | rd | 0101011 |
- func7 (7-bits): A constant field containing
'0000000'
. - rs2 (5-bits): Specifies the index register.
- rs1 (5-bits): Specifies the base register.
- func3 (3-bits): A constant field containing
'010'
. - rd (5-bits): Specifies the destination register.
- opcode (7-bits): The opcode for this custom instruction containing
'0101011'
.
Additional Information about the RISC-V Instruction structure is provided in RISC-V Card
LWI Assembly format:
'lwi <rd>,<rs2>(<rs1>)
'
Here the <rd>
, <rs1>
, and <rs2>
placeholders denote the fields which specify the destination, base, and index registers, respectively.
LWI Assembly example:
0: 0x16A232B lwi t1, s6(s4)
Operation
The operation for this instruction can be represented as the following C code:
void lwi(uint32_t rs1, uint32_t rs2, uint32_t &rd) {
rd = *(int32_t*)(rs1 + rs2);
}
- Note: The index is not word scaled: it is byte-scaled. This means that the value in rs2 is treated as a byte offset rather than a word offset. To access the n-th 32-bit element in an int array, for example, you would need to set
rs2
ton * 4
, not simply n.
Assignment Steps:
Make modifications in the picorv32.v
file to add the custom instruction. These changes are divided into 3 steps (Refer to your answers for Part 1: Q3
):
- Modify the instruction decoder of the processor to decode and recognize the instruction.
- Modify the CPU state transitions to enter the correct mode when the new instruction is detected.
- Modify the
cpu_state_ldmem
or create a custom-defined state to execute the operation for the custom instruction.
Hint: If you are having issued with the above steps, start by understanding how
lw
(Load Word) instruction is decoded and implemented.
Running the Modified Code
After implementing the LWi instruction in the Verilog code, it is time to simulate and verify its functionality. Follow the steps below:
-
Read and understand the code which uses LWi in action:
1 _start: 2 # Load base address of 0x4000_0050 into t0 3 lui t0, 0x40000 # Load upper 20 bits of address into t0 4 5 # Load the value at address 0x4000_0050 into t1 6 addi t1, x0, 0x50 7 lwi t2, t1(t0) # Load the 32-bit value from address 0x4000_0050 into t1 8 9 # Load the value at address 0x4000_0054 into t2 10 addi t1, t1, 4 11 lwi t3, t1(t0) # Load the 32-bit value from address 0x4000_0054 into t2 12 13 # Perform addition 14 add t3, t3, t2 # Add t1 and t2, store result in t3 15 16 # Infinite loop with nop instructions 17 loop: 18 nop # No operation 19 j loop # Jump to the start of the loop
-
Open testbench.v and verify that it reads the modified memory as below and save the file. This will ensure that the simulator loads the modified memory module. Do not forget to comment the baseline memory module.
57 // Baseline ISA 58 // $readmemh("memory_data.mem", mem.memory_array); 59 60 // Modified ISA 61 $readmemh("mod_memory.mem", mem.memory_array);
-
Follow
Flow navigator → SIMULATION → Run Simulation
and wait until you see the wave outputs. -
By looking at the OPCODE signal, verify that you can see the LWI opcode bits.
-
Note down the value stored in t3. Compare with the Expected Simulation Output in the Simulation section.
Only then, Generate the Bitstream and program the FPGA. Refer to Synthesizing & Programming and the following instructions to program the FPGA.
Groups using the Lab desktops follow the instructions in Flashing FPGA from the Desktop, to program the FPGA from the Desktops.
Groups facing issues with Vitis Classic. Follow the step mentioned in Running remotely to run your program.
Note: Do not forget to note down number of cycles before you pass on the FPGA to the next group.
Part 3: Testing and Profiling the new instruction
Once you have the instruction implemented, it is important to understand how the new instruction can be used in a real-world application.
Answer the following questions based on your observations from the simulation and serial log:
-
How many cycles are required for the
lw
,add
, andlwi
instructions in simulation? -
Refer to the img_blur code for Baseline implementation and Modified implementation and calculate the theoretical difference in cycle count between the two programs.
- Hint: Given that our image is 32x32 pixels the blur loop will run 1,024 times
-
What is the actual difference in cycle count you noted down at the end of Part 2 when the Baseline and Modified programs are run on the FPGA?
-
When you compare your theoretical difference in cycle count with the actual difference, you can see that they do not match. This is because simulation is only an abstraction of the actual system. The memory is connected directly to the processor in the simulation, but on the FPGA it goes through many intermediatory components. You can see that in the Block Diagram.
Given that, reading an instruction from the instruction memory takes four more cycles for each instruction, and each time you load a value from the memory it takes one more cycle.
Recalculate your theoretical cycle counts now. Do they match?
BONUS: Answer the following questions
-
Why is image blurring a good use of the
lwi
instruction, suggest 2 more algorithms that can benefit from this instruction. -
Suggest a new custom instruction that can also optimize the image blurring program. Justify your answer.
-
How will the ratio of cycle count in baseline vs modified change, if the number of pixels is increased? Justify your answer.
-
Area Utilization (To answer this question Refer to Area Utilization Report):
(a). Does the added instruction significantly increase slice utilization? (for more information: FPGA Slices)(b). How might this affect scalability or the addition of further instructions in the future?