Lecture 1: What is Rust?
Software Fundamentals is part of the Computer and Embedded Systems Engineering study. Programming computer systems is an important part of embedded systems. C (together with C++) has been the standard for programming such systems for many years, to a point that many of you will likely be familiar with it. C has many advantages:
- C has had tooling built for it for close to 50 years
- Many people know C
- It has become a standard for foreign function interfaces (communication between programming languages)
- It has a well-defined specification (ANSI C) with a well-defined ABI that can (often) be relied on
- It is very fast compared to almost all other languages
However, over the years, many people have sought to improve C since the language also has several shortcomings:
- Few compile-time checks (leading to bugs)
- Relatively weak type system (you can very easily convert pointers of one type to another)
- Non-trivial safety problems like buffer overflows (no memory safety)
- No namespacing, all functions are (essentially) global
- Many non-standard extensions by compiler vendors that are relied upon by many
- Bad unicode support
Languages trying to improve on C have existed for a long time. Some examples you may know:
- C++ (Adds classes, lambda functions, generic templating, compile-time code execution)
- Java (Adds classes, a garbage collector, portable binaries, though not always as useful for embedded)
In this course, we're going to be using, and teaching you, a language called Rust. Rust is a relatively new language (it has only existed since 2009), and yet it has been the most loved (now called 'admired') programming language for the past 8 years. Because the language is so new, it takes advantage of many lessons learned in programming-language development in the past 40 years, that C hasn't benefited from because of its age.
The language was originally created by Graydon Hoare at Mozilla, but is currently developed by the Rust community. The original pitch by Hoare can still be found here, although it looks almost nothing like modern Rust. Rust is open source, you can find (and contribute to) it's source here. Every 6 weeks there is a new release with bug fixes, and possibly, new language features.
In this course, we will teach you to work with Rust, and talk about what sets it apart and why we chose to teach you this compared to other systems languages like C and C++
In the rest of these notes, we will compare Rust in a few ways to C, C++ and Java. Although C and C++ might sound obviously relevant since they are also systems programming language, Java maybe less so. Still, it provides some contrast being a fairly high level programming language with similar syntax to the other languages discussed. Additionally, Java is the main language taught in the Computer Science department.
Memory safety
In this part, we will refer to the "heap" and the "stack". In lecture two, we will talk more about what those are. You can already read about it here.
Memory safety is the property of a programming language aimed to prevent memory errors. When a language is not memory safe, bugs like the following can happen:
- Out of bounds reads and writes (of arrays), reads and writes to uninitialized memory. In a language like C, if you make e.g. an array of 5 elements, C won't prevent you from reading the 11th element even though that won't exist. The value that is read may be another variable, or uninitialized (and random). If a value is written out-of-bounds, it may overwrite another variable, or crash the program.
- Race conditions. Memory that is shared between different threads of execution are concurrently read from and written to leading to unexpected behavior.
- Use-after-free. Some memory is allocated, then used, then deallocated. But if a pointer to the allocated memory still exist, the memory can still be read from or written to causing unexpected behavior.
- Memory leaking. Allocations are made, but never freed leading to a constant increase in memory usage which may starve the system of resources.
- However, Rust does consider memory leaks to be safe as it can not trigger undefined behaviour. Of course Rust still tries to minimize memory leaks as much as possible.
These issues may lead to unexpected crashes of programs. However, if one of these bugs occurs and the program does not crash, the result may be even more alarming.
- Out-of-bounds reads may read data that is supposed to be private.
- Race conditions may cause writes to be missed, which could even lead to deadlocks.
- If memory is still used after deallocation, but another, unrelated, allocation is made at the same location, secret information may leak, or vital information could be corrupted.
- Leaking of memory may cause other programs which may be vital to the operation of the system to suffer.
A memory-safe language aims to make it impossible to be in one of those situations. This can be achieved through careful compile-time checking or through extra work at runtime. This article is a very interesting read if you want to learn more about the different possible techniques used to check for memory safety at compile time.
Memory safety in C, C++ and Java
Let's look at an example of a language that is not memory-safe: C.
#include <stdio.h>
char * read() {
char buf[20];
gets(buf);
return buf;
}
int main() {
char * r1 = read();
char * r2 = read();
printf("%s -> %s\n", r1, r2)
}
void secret() {
printf("secret!");
}
This program simply reads two strings from a user, and then prints that. What could go wrong? Well, what if the user types
21 characters? There's only room for 20. What would we overwrite? Well, right next to buf
, there is probably the return pointer. The location to go to when
the read function is done (back to main
). If we type 20 characters, and then the address of the secret
function, we'd return to secret
instead of main and print "secret!".
But let's say we did only type less than 21 characters. Let's say we input hello and world.
$ ./program
hello
world
> world -> world
We would probably see "world -> world". Why? Buf is a local variable in read
. If we return from read, the variable doesn't exist anymore. So r1
doesn't
really point to a proper string any more. But it's very likely that the data is still stored there anyway. And if we call read
again? Then the same memory location
is used for the second read. And then returned again meaning the variable is invalid again. So this program would either crash, or show "world -> world" (the second string twice).
In C++, something similar will happen.
However, in higher level languages like Java, which is memory safe, we would not experience these kinds of issues.
First, an array like char[20] would be allocated on the heap. Both times we run the function, a new allocation is made, and thus a new string is created.
Next, strings are not constant-size. If you input more than 20 characters, the allocation will simply grow and the extra characters will be stored in the extra space.
Strings also store their own length. Every time when data is accessed (by printf
for example), a check is done to see if the accessed data is within bounds (which can be checked when you know the length).
Instead of losing the allocation when read
would return, all allocations are sort-of "permanent". They last until they are not used anymore. Returning a string is therefore no issue.
import java.io.*;
public class Main {
static String read() {
Console console = System.console();
return console.readLine();
}
public static void main(String[] args) {
String r1 = read();
String r2 = read();
System.out.println(r1 + "->" + r2);
}
// cannot be accidentally or maliciously called
static secret() {
System.out.println("secret!");
}
}
So how do we check when data is not used anymore? A common method is called reference counting.
The garbage collector and reference counting.
Part of Java's solution to this memory unsafety is the reference counting. It works as follows. Every time we make an object (like a string), we also store a so-called "reference count", which will keep track how many "users" there are of this string. The only way we can use a string is if the program knows where its stored (the object is referenced). So whenever we create more references to an object, the language automatically adds code which increments this reference counter. And when you stop using a reference to an object, the reference counter is decreased. At some point, the reference count may reach zero. At that moment, we know that no variable in the program knows the location of the string object any more. And that also means it can't ever be used again, and thus may safely be deallocated (deleted).
With stopping to use a reference, we do not mean that objects are simply deleted after a while. Instead, when an object is used in a function and the function returns (and that object is not returned), there is no way to access the object anymore. At that point, we can be sure that it is safe to decrease the reference count by one.
String exampleFunction() { String example = "This is an example string"; // at this point it becomes impossible to access `example` anymore. // as the variable is out of scope. // Therefore, we can be sure the reference count can be decreased. return "this is the string that is returned"; }
There is one problem with this strategy. Let's look at the following C program, and let's pretend that C actually uses reference counting.
// Create a struct example, which points to another example struct
typedef struct example {
struct example * other;
} example_t;
void create_circular() {
// e1 referenced once
example_t * e1 = (example_t *)malloc(sizeof(example_t));
// e2 referenced once
example_t * e2 = (example_t *)malloc(sizeof(example_t));
// e1 references e2, so e1 is referenced twice
e1.other = e2;
// e2 references e1, so e2 is referenced twice
e2.other = e1;
}
int main() {
create_circular();
// Both variables, e1 and e2 are gone after calling the function.
// But e1 still references e2, and e2 still references e1. Their reference
// counts are still both 1 and won't be freed by a reference-counting scheme
}
In actual c, you wouldn't have reference counting and this code would just contain a memory leak. Although, even if reference counting is used, (like Java would do), this code would still contain a memory leak! To solve that, we need a garbage collector. A garbage collector is, in essence, a program that periodically searches for unused circular references, and removes them. How exactly this works isn't that important, but if you want to know more you can just search for it.
What is important, is that this all takes time. Reference counting takes time (and some space, for the reference count number). Periodically sweeping memory to search for circular references takes time. Allocating all objects on the heap takes time. Checking the bounds of arrays takes time. Thus, in Java and other such languages, we pay for memory safety with slower program execution. This may be well worth it, but especially on embedded systems which are resource-constrained, this extra slowdown can be unacceptable.
Although we will see that Rust doesn't need a garbage collector, people have made garbage collectors for Rust. What's neat is that it's possible to implement one in Rust itself, without needing to change the language itself. This is a link to an interesting article about such a garbage collector.
Static memory safety
Rust takes a new approach. Instead of handling memory safety at runtime, it tries to detect memory unsafe programs at compile time, partly through the type system. This means that if you make a mistake which could cause memory safety bugs, your program simply will not compile and in Rust's case, a helpful error message is shown explaining why it can't allow what you tried.
Let's translate our previous example to Rust and see what it has to say:
// let's assume this exists in the standard library fn gets(buf: &str) {todo!()} fn read() -> &str { // an empty string of 20 characters (non-ideomatic) let buf = " "; // not actually a function in Rust, but let's assume // it exist gets(&buf); return &buf; } fn main() { let r1 = read(); let r2 = read(); println!("{} {}", r1, r2); }
Unicode (A Slight Detour)
In C, a character is the size of a byte. That would mean that there can only be a total of 256 characters. At first, this was enough, but as soon as computers started communicating across international borders, problems arose.
- Some languages have accents on letters, like French, German, Scandinavian languages, and more
- Many languages don't even use the Latin alphabet.
- Some languages have more than 256 "letters" (Unicode calls them "graphemes" since not all languages use letters, like the symbols in many Asian languages)
- Many countries have different currency symbols
- People want to use both Emojis and Flags
The Unicode consortium has created a standard in which all symbols you would ever want to use in writing are assigned a number, also called a unicode codepoint. The most common encoding before unicode was ASCII. All ASCII symbols kept their old codepoints, but many more were added. In total almost 100.000 in unicode version 4.0.
To store characters as numbers up to 100.000 (and more are supported), we need to store them in more than just one byte. In Rust, the
char
datatype is actually a 32-bit integer.If every character was a 32 bit integer, the text "hello" would be 20 bytes long. But why? Many of the unicode codepoints are rarely used. Not even all codepoints are assigned, so some are never used.
To encode strings of unicode data, different encodings were invented. The most common is UTF-8. This is a variable-length encoding where when a letter is in the old ASCII range, the size of each character is only 1 byte. Only when a character is used that does not exist in ASCII, more bytes are used to encode it. 2, 3 or 4 depending on the character.
A Rust
&str
is always utf-8 encoded and is thus not equivalent to achar *
(byte array). You can create byte strings with the special byte string literal:b"test"
(with ab
in front)
The &str
type describes a reference to a string. So let's change the program a bit.
#![allow(unused)] fn main() { fn read() -> &str { let buf = " "; gets(&buf); return buf; } }
But the Rust compiler will loudly complain about this program. gets
didn't get a mutable reference.
It can't modify the string to put the read data in.
And that's a good thing. Rust wouldn't allow us to make this string mutable even if we wanted to since we don't know the length in advance.
A gets
function would want to grow and shrink the string based on how many characters a user types.
The only way to do this is to allocate the string on the heap. That way it can grow (this is exactly what Java did).
Let's change the program again. We do that by using the String
type, which is a heap-allocated string.
(Do note that none of the programs above compiled so far. In each example something was wrong that could
be a safety bug, and the compiler stopped each from compiling meaning that the bug could not occur when running the program).
// let's assume this exists in the standard library fn gets(buf: &mut String) {unimplemented!()} fn read() -> &str { let mut buf = String::new(); gets(&mut buf); return buf; } fn main() { let r1 = read(); let r2 = read(); println!("{} {}", r1, r2); }
But this still won't compile. We return a reference to a string from this function &str
. But buf
is a local variable.
This is exactly the same problem we had in our C program, where after the return the buf
wouldn't contain the data we wanted to anymore.
The solution? Don't return a reference, but return the ownership of the heap allocated string. This stops it from being deallocated when the function returns,
and instead deallocates buf
when main returns (and the program ends). That looks like this:
#![allow(unused)] fn main() { fn read() -> String { let mut buf = String::new(); gets(&mut buf); return buf; } }
This is somewhat equivalent to the Java version, where the string is allocated on the heap and can therefore live longer than the function where it is created in. Rust does not need reference counting since it only allows a single place to have ownership of an object. If there can only be one reference, we don't need to keep track of it.
Abstractions for systems programming (only C and C++)
The Rust compiler is conservative. That means, it checks your program, and guarantees that it won't be memory-unsafe at runtime (see memory safety). But sometimes, a program that is in fact memory-safe, is rejected. The compiler isn't always aware of all the invariants of the exact system you are programming for. For example, some categories of race conditions cannot happen on single-core machines. This poses a problem, since sometimes you need to do something that is safe, but the compiler can't guarantee that.
Let's look at a relatively simple example: Rust's safety rules say that you may only have a single mutable reference to an allocation at a time. So at the same time, it must be impossible to update a variable from two different places, because that can cause a race condition.
An array can be considered a single allocation. But when you split the array into two halves, it's actually perfectly safe to mutate the two halves at the same time since the halves don't overlap at all.
So it is safe to write a function like
#![allow(unused)] fn main() { // splits `inp` at index `at` and returns a mutable // reference to the left and right halve fn split<T>(inp: &mut [T], at: usize) -> (&mut [T], &mut [T]) { unimplemented!() } }
But Rust cannot check it to be correct (what if the halves do overlap. That entirely depends on your implementation).
To accomplish this, you can introduce small sections of so-called "unsafe" code in your program. Code that is so-called "sound" but the compiler cannot check it. For split, this would be:
#![allow(unused)] fn main() { pub unsafe fn split_at_mut_unchecked<T>(inp: &mut [T], mid: usize) -> (&mut [T], &mut [T]) { let len = inp.len(); let ptr = inp.as_mut_ptr(); // SAFETY: Caller has to check that `0 <= mid <= self.len()`. // // `[ptr; mid]` and `[mid; len]` are not overlapping, so returning a mutable reference // is fine. unsafe { // from_raw_parts creates a new &mut [T] from a pointer and a length (from_raw_parts_mut(ptr, mid), from_raw_parts_mut(ptr.add(mid), len - mid)) } } fn split<T>(inp: &mut [T], mid: usize) -> (&mut [T], &mut [T]) { assert!(mid <= inp.len()); // SAFETY: `[ptr; mid]` and `[mid; len]` are inside `inp`, which // fulfills the requirements of `from_raw_parts_mut`. unsafe { split_at_mut_unchecked(inp, mid) } } }
Notice the unsafe
blocks. Code inside those can be memory unsafe if the programmer wants it to.
The reason this is better than allowing unsafe code everywhere, like in C, is that you can audit these small sections much more easily than the entire codebase of a program.
If a memory bug occurs, they must have occurred in these sections. If there are relatively few of them, that makes your life easier.
Notice too, that the split
function is not marked as "unsafe". split
uses unsafe code, but it has been thoroughly checked and
as long as you perform the unsafe operations through the split
function, your program is safe.
We call the split
function a safe abstraction. They form small enclaves of unsafe code that perform
common operations. And as long as you don't write the unsafe code for those operations yourself, and instead
use the safe abstraction, your program is safe!
So why Rust?
Rust promises both fast and safe code, and is suitable to program embedded systems. That makes it a strong alternative to languages like C and C++. Should we do away with C and C++? No, of course not. So many systems already run on C code and work fine. But adding some more safety to systems without much runtime cost, is something worth doing in the future.
Learning Rust
The Rust language comes with a book, written by Steve Klabnik, Carol Nichols, and the Rust community. Together with these lecture notes and the book, you may learn enough about the language already. However, in the live lectures and the labs, there will be opportunity to ask questions which we encourage you to do. We can have discussions about them, so everyone can learn from your questions.