Lab 2: RISC-V

Welcome to Lab 2, a medium-sized Verilog project focused on FPGA development using Vivado. In this lab, you will be working with a custom RISC-V CPU core and implementing a Scaled Index (SI) load instruction to extend its functionality.

DOWNLOAD THE TEMPLATE : PicoRV32_template. Inside you will find a Baseline project, which you will use to implement the new instruction.

IMPORTANT VIVADO NOTE : When you make any changes in the Verilog file, you need to open the block diagram and Refresh the Changed Modules OR Update IP. If not done your generated bitstream will be built using the OLD Verilog code. This is a Vivado tool requirement and we cannot do anything about this.

Table of Content

Lab Evaluation Method
Part 1: Understanding RISC_V and PicoRV32 core
Part 2: Adding an instruction
Part 3: Testing and Profiling the new instruction
BONUS: Answer the following Questions

Learning Goals of the Lab

Introduction to RISC-V Architecture
Understanding how instructions are executed on a processor
Implementing a custom instruction and understanding the impact of this
Introduction to Vitis Classic and understanding how memory works in Vivado

Lab Evaluation Method

Each Part of this Lab has questions that you need to answer as a GROUP. Once you have finished a Part, you can answer the questions and call a TA to get it verified (if TAs are busy and you are confident in your answers, you can move on to the next Part, but get it verified before the end of the lab session).

We suggest you carry pen and paper (or note-taking tablet) with you to the lab to answer these questions.

Part 1: Understanding RISC-V and PicoRV32 core

You may do this part of the assignment before the lab session.

The link to download the template is provided at the beginning of this page!

In this section, you will understand how RISC-V instructions are structured and how the PicoRV32 executes an instruction. Answer the following questions and once completed, call a TA for sign-off.

Decode the following instruction and write it in assembly format using RISC-V Instruction Card and RISC-V Register Map

32'b00000000000100111110111010010011

Convert the assembly code into binary machine code, with clear demarcation for the various sections of the instruction using RISC-V Instruction Card and RISC-V Register Map:

add t0, a2, s3

Understanding the load_word instruction implementation in the PicoRV32 Processor

(a). What two local registers are used to uniquely identify the load word instruction? What fields of instruction are used for this? (You can refer to PicoRV32 Instruction Decoder)

(b). Draw a block diagram showing all the CPU state transitions involved in a load word instruction where the first and last cpu-state is fetch. (You can refer to PicoRV32 CPU States)

(c). What does the cpu_state_ldmem do? List all the instructions that use this cpu_state_ldmem state as part of their execution (Refer to picorv32.v line no 1856)

Part 2: Adding an instruction

In this Part, you will be adding a Scaled Indexed Load instruction, formally known as Load Word Indexed (LWI) (not "immediate" if you are familiar with MIPS assembly). To add this custom instruction to the RISC-V core, you will have to modify the Verilog code of the processor. The processor should be able to decode the new instruction and execute the expected operation when the instruction is called.

Understanding the Template

The link to download the template is provided at the beginning of this page!

The template provided has additional components other than the PicoRV32 core. You should read through the Project Structure section to get a better understanding of how the project works and how the PicoRV32 core executes code.

Running the Baseline

For groups who do not have Vivado installed in their laptops, you can use the desktops in the Lab. To use the desktops, follow the steps explained in Using the Desktop

Before making changes in the code, you should try simulating the baseline project. Please follow these steps:

Read and understand the baseline code shown below:

1  _start:
2    # Load base address of 0x4000_0050 into t0
3    lui  t0, 0x40000     # Load upper 20 bits of address into t0
4    addi t0, t0, 0x50    # Add lower 12 bits of address to t0 to get 0x4000_0050
5
6    # Load the value at address 0x4000_0050 into t1
7    lw   t1, 0(t0)       # Load the 32-bit value from address 0x4000_0050 into t1
8 
9    # Load the value at address 0x4000_0054 into t2
10   lw   t2, 4(t0)       # Load the 32-bit value from address 0x4000_0054 into t2
11   
12   # Perform addition
13   add  t3, t1, t2      # Add t1 and t2, store result in t3
14   
15   # Infinite loop with nop instructions
16   loop:
17       nop                   # No operation
18       j loop                # Jump to the start of the loop

Open testbench.v and verify that it reads the baseline memory as below and save the file. This will ensure that the simulator loads the unmodified memory module.
```
57.    // Baseline ISA 
58.    $readmemh("memory_data.mem", mem.memory_array);
```
Follow Flow navigator → SIMULATION → Run Simulation and wait until you see the wave outputs.
By looking at the OPCODE signal, you can verify that instructions are being decoded one after another. There is no need to manually decode the instructions.
Note down the values stored in t2 and t3. Right click on the signal to change radix. And compare with the Expected Simulation Output.

LWI Instruction

The Indexed Load instruction, described by the opcode 0101011, is used for advanced memory addressing techniques. It uses rs1 and rs2 to calculate the effective memory address. Two registers instead of one and an offset when compared to the lw instruction.

The lwi instruction, short for "Load Word Indexed" has a specific encoding in the RISC-V instruction set. The instruction is encoded as follows:

31 ... 25	24 ... 20	19 ... 15	14 ... 12	11 ... 7	6 ... 0
funct7	rs2	rs1	funct3	rd	opcode
0000000	rs2	rs1	010	rd	0101011

func7 (7-bits): A constant field containing '0000000'.
rs2 (5-bits): Specifies the index register.
rs1 (5-bits): Specifies the base register.
func3 (3-bits): A constant field containing '010'.
rd (5-bits): Specifies the destination register.
opcode (7-bits): The opcode for this custom instruction containing '0101011'.

Additional Information about the RISC-V Instruction structure is provided in RISC-V Card

LWI Assembly format:

'lwi <rd>,<rs2>(<rs1>)'

Here the <rd>, <rs1>, and <rs2> placeholders denote the fields which specify the destination, base, and index registers, respectively.

LWI Assembly example:

0:	0x16A232B          	lwi t1, s6(s4)

Operation

The operation for this instruction can be represented as the following C code:

void lwi(uint32_t rs1, uint32_t rs2, uint32_t &rd) {
    rd = *(int32_t*)(rs1 + rs2);
}

Note: The index is not word scaled: it is byte-scaled. This means that the value in rs2 is treated as a byte offset rather than a word offset. To access the n-th 32-bit element in an int array, for example, you would need to set rs2 to n * 4, not simply n.

Assignment Steps:

Make modifications in the picorv32.v file to add the custom instruction. These changes are divided into 3 steps (Refer to your answers for Part 1: Q3 ):

Modify the instruction decoder of the processor to decode and recognize the instruction.
Modify the CPU state transitions to enter the correct mode when the new instruction is detected.
Modify the cpu_state_ldmem or create a custom-defined state to execute the operation for the custom instruction.

Hint: If you are having issued with the above steps, start by understanding how lw (Load Word) instruction is decoded and implemented.

Running the Modified Code

After implementing the LWi instruction in the Verilog code, it is time to simulate and verify its functionality. Follow the steps below:

Read and understand the code which uses LWi in action:

1  _start:
2   # Load base address of 0x4000_0050 into t0
3   lui  t0, 0x40000  # Load upper 20 bits of address into t0
4
5   # Load the value at address 0x4000_0050 into t1
6   addi t1, x0, 0x50
7   lwi   t2, t1(t0)  # Load the 32-bit value from address 0x4000_0050 into t1
8  
9   # Load the value at address 0x4000_0054 into t2
10  addi t1, t1, 4
11  lwi   t3, t1(t0)  # Load the 32-bit value from address 0x4000_0054 into t2
12  
13  # Perform addition
14  add  t3, t3, t2      # Add t1 and t2, store result in t3
15  
16  # Infinite loop with nop instructions
17  loop:
18      nop                   # No operation
19      j loop                # Jump to the start of the loop

Open testbench.v and verify that it reads the modified memory as below and save the file. This will ensure that the simulator loads the modified memory module. Do not forget to comment the baseline memory module.
```
57    // Baseline ISA
58    // $readmemh("memory_data.mem", mem.memory_array);
59    
60    // Modified ISA
61    $readmemh("mod_memory.mem", mem.memory_array);
```
Follow Flow navigator → SIMULATION → Run Simulation and wait until you see the wave outputs.
By looking at the OPCODE signal, verify that you can see the LWI opcode bits.
Note down the value stored in t3. Compare with the Expected Simulation Output in the Simulation section.

Only then, Generate the Bitstream and program the FPGA. Refer to Synthesizing & Programming and the following instructions to program the FPGA.

Groups using the Lab desktops follow the instructions in Flashing FPGA from the Desktop, to program the FPGA from the Desktops.

Groups facing issues with Vitis Classic. Follow the step mentioned in Running remotely to run your program.

Note: Do not forget to note down number of cycles before you pass on the FPGA to the next group.

Part 3: Testing and Profiling the new instruction

Once you have the instruction implemented, it is important to understand how the new instruction can be used in a real-world application.

Answer the following questions based on your observations from the simulation and serial log:

How many cycles are required for the lw, add, and lwi instructions in simulation?
Refer to the img_blur code for Baseline implementation and Modified implementation and calculate the theoretical difference in cycle count between the two programs.
- Hint: Given that our image is 32x32 pixels the blur loop will run 1,024 times
What is the actual difference in cycle count you noted down at the end of Part 2 when the Baseline and Modified programs are run on the FPGA?
When you compare your theoretical difference in cycle count with the actual difference, you can see that they do not match. This is because simulation is only an abstraction of the actual system. The memory is connected directly to the processor in the simulation, but on the FPGA it goes through many intermediatory components. You can see that in the Block Diagram.

Given that, reading an instruction from the instruction memory takes four more cycles for each instruction, and each time you load a value from the memory it takes one more cycle.

Recalculate your theoretical cycle counts now. Do they match?

BONUS: Answer the following questions

Why is image blurring a good use of the lwi instruction, suggest 2 more algorithms that can benefit from this instruction.
Suggest a new custom instruction that can also optimize the image blurring program. Justify your answer.
How will the ratio of cycle count in baseline vs modified change, if the number of pixels is increased? Justify your answer.
Area Utilization (To answer this question Refer to Area Utilization Report):
(a). Does the added instruction significantly increase slice utilization? (for more information: FPGA Slices)

(b). How might this affect scalability or the addition of further instructions in the future?

Computer Engineering