Lecture 5: Collections and Iterators
Contents
Collections
Built into the Rust standard library, is a module called "collections". In it, there are a number of commonly used
datastructures. One of those is the Vec
, which we have seen before. In this lecture, we will also look at
HashMap
and HashSet
, two datastructures you might find useful.
If you prefer listening or watching a more concise lecture about this topic, we highly recommend the video by Jon Gjengset about this topic.
Hashmap
A HashMap
is a data structure that works a bit like a database.
Elements in a HashMap
always have some kind of key
, together with an associated item.
For example, in a database of students, the key might be your student ID and the value is the information associated with that student ID.
Let's imagine we are designing our own HashMap of keys and values. We might imagine a vector of tuples:
struct Student { name: &'static str, average_grade: f64, } fn main() { let mut v = Vec::new(); v.push((1, Student {name: "Sarah", average_grade: 7.6})); v.push((12, Student {name: "Tim", average_grade: 8.2})); v.push((34, Student {name: "Josh", average_grade: 5.9})); v.push((42, Student {name: "Cat", average_grade: 9.1})); }
The type of each pair is (i32, Student)
here.
Let's say that we would like to find the student with ID 42.
To do that, we can simply iterate over all students until we find Cat.
#![allow(unused)] fn main() { // find a value for a specific key println!("{}", v.iter().find(|(k, _)| k == 42)); // prints: Some(Student {name: "Cat", ..}) }
In this case the list of students is not too long, so finding the student with ID 42 won't take too long. But the longer the list becomes, the slower it is to find specific students. Unless we are lucky of course. If we had wanted to find the student with ID 1 in this example, it would be very quick since Sarah happens to be the first student in our list!
We call an operation that, like this one, becomes slower the more items we add: O(n)
.
That notation essentially says that the time it takes grows like a linear function.
So can we do better? What if we knew that all the keys are numbers between 0 and 100. We could create an array of 100 elements like this:
#[derive(Copy, Clone)] struct Student { name: &'static str, average_grade: f64, } fn main() { let mut v = [None; 100]; v[1] = Student {name: "Sarah", average_grade: 7.6}; v[12] = Student {name: "Tim", average_grade: 8.2}; v[34] = Student {name: "Josh", average_grade: 5.9}; v[42] = Student {name: "Cat", average_grade: 9.1}; // find the value of a key: println!("{}", v[14]); // None println!("{}", v[42]); // Some(Student {name: "Cat", ..}) }
On the last line here, we attempted to find a student with ID 42 and instantly found Cat, since their information was stored at index 42.
To find that, we did not actually need to search through the array, so the operation becomes what we call constant time or O(1)
.
The speed of this lookup does not at all depend on how many students we store, as long as we know the ID of the student we are looking for.
Notice here, that we can only have a single student with an ID. There's only a single place for every ID, so when we add a second student with an ID that was already in use, the second student would overwrite the first.
The instant-lookup trick with arrays only works when our keys are integers. Let's say we want to find people by their name, we cannot make an array where we can use strings to index it. Furthermore, if we don't know a maximum value for these integers, we might need to allocate enormous amounts of memory so that is infeasible as well.
That, is what a HashMap is. It's like an array, that can do these instant-lookups by their key, but unlike the array example it works for almost any key imaginable. Let's look at the example below. Here we indeed use student's names as a key, not their student number.
use std::collections::HashMap; #[derive(Copy, Clone)] struct Student { student_id: u64, average_grade: f64, } fn main() { let mut v = HashMap::new(); v.insert("Sarah", Student { student_id: 1, average_grade: 7.6 }); v.insert("Tim", Student { student_id: 12, average_grade: 8.2 }); v.insert("Josh", Student { student_id: 34, average_grade: 5.9 }); v.insert("Cat", Student { student_id: 42, average_grade: 9.1 }); v.get("Cat") // Some(Student {student_id: 42, ..}) }
The code looks pretty similar to the original Vec
example, but now using get
instead of indexing.
With a HashMap
, on average, it always takes the same amount of time, regardless of the size of the hashmap to find items.
Do note, that HashMap
s do not have a defined order.
Because of how arbitrary keys are turned into indices in an array (which we call hashing)
Let's look at another example where HashMap
is useful: counting occurrences of words.
Let's say we have a large text, like this assignment, and we are interested in how many times each word is used.
We expect to find words like 'the' and 'and' and 'a' to be very frequent.
Let's use this as our setup:
fn find_occurences(data: &str) { // split on words for word in data.split(' ') { } } fn main() { let words = include_str!("lecture-5-iterators.md"); find_occurences(words); }
At this point, first think about how you would do this while using a `HashMap` without reading ahead, where will show the solution. To go ahead, open the tab below.
fn find_occurences(data: &str) { let counter = HashMap::new(); // split on words for word in data.split(' ') { if counter.contains_key(word) { // if the word was seen before, add 1 *counter.get_mut(word) += 1; } else { // if the word wasn't seen before, now we've seen it once counter.insert(word, 1); } } println!("{}", counter); } fn main() { let words = include_str!("lecture-5-iterators.md"); find_occurences(words); }
Other operations on a hashmap are for example: contains_key
(gives a boolean saying whether a key is present)
and remove
(deletes a key-value pair by its key). For more information, see
the standard library.
HashSet
A HashSet
is in many ways similar to a HashMap
.
You can think of a HashSet
like a HashMap
with no values, only keys.
You might think that such a data structure is absolutely useless.
However, that's not at all true, because of two properties HashMap
and therefore also HashSet
has.
- A
HashMap
/HashSet
cannot contain duplicate elements - You can see if a
HashMap
/HashSet
contains an element in constant time.
A HashSet
can therefore be useful to for example find if there are duplicates in a list of items, like this:
#![allow(unused)] fn main() { fn has_duplicates(elems: &[u64]) -> bool { // make a hashset let res = HashSet::new(); // put all elements in for i in elems { // notice: this is a set so we don't give a value res.insert(i); } // if an element in `elems` was already seen before, it will overwrite // go at the same location as the previous occurence in the HashSet // thus, the total length of the set will be smaller than the length of `elems` res.len() < elems.len() } }
You can use a very similar program to find "items you have seen previously". You put every item you see in the Set and then when you notice it overwrites another element you found your first duplicate.
Iterators
The different collections in Rust, all have a method called .iter()
. This method, returns
an iterator. But what is an iterator?
Simply put, an iterator is any value that may have some kind of next value.
For example, you can turn a Vec
into an iterator (through iter()
) and go from one item to another:
fn main() { // a collection of items let items = vec![1, 2, 3, 4, 5]; // make an iterator let mut iterator = items.iter(); assert_eq!(iterator.next(), Some(&1)); assert_eq!(iterator.next(), Some(&2)); assert_eq!(iterator.next(), Some(&3)); assert_eq!(iterator.next(), Some(&4)); assert_eq!(iterator.next(), Some(&5)); // when there are no more elements, return None assert_eq!(iterator.next(), None); }
Notice that Vec
is not an iterator itself. We can create an iterator that iterates over a Vec
,
but Vec
does not have this property of having a next item on its own.
When we create an iterator over a Vec
, we essentially start by pointing to the first element.
Every time we call next
, we advance to the next item, and at some point next
may return None signalling that there is no next item.
At this point, you may notice that this is not the only possible iterator over a Vec
.
Indeed, you could also write:
fn main() { // a collection of items let items = vec![1, 2, 3, 4, 5]; // make an iterator let mut iterator = items.iter().rev(); assert_eq!(iterator.next(), Some(&5)); assert_eq!(iterator.next(), Some(&4)); assert_eq!(iterator.next(), Some(&3)); }
To iterate over the Vec
in reverse, from the end to the start.
For loops
In many languages, a for loop is an abstraction over simpler loop kinds to iterate a specific number of times. Although this is technically true in Rust too, it might be more accurate to see a for loop as an abstraction over "finishing an iterator". A for loop has to ability to take any iterator, and repeat a block of code until the iterator is exhausted.
Let's take the vec example again from above
fn main() { // a collection of items let items = vec![1, 2, 3, 4, 5]; // make an iterator let iterator = items.iter(); // run the iterator until the end for i in iterator { println!("{}", i); } }
The for loop takes the iterator, and prints every element until the end, where the end is marked by the iterator returning None
.
Because this is an abstraction, we can look at what the more primitive while-loop version of this code looks like to understand better what is going on:
fn main() { // a collection of items let items = vec![1, 2, 3, 4, 5]; // make an iterator let iterator = items.iter(); // run the iterator until the end let mut temp = iterator.next(); while temp != None { // can't fail because we just looked it's not None let i = temp.unwrap(); // original code in the for loop println!("{}", i); // advance the iterator temp = iterator.next(); } }
So what can we learn from this? One thing is that the for loop you have most likely used most:
#![allow(unused)] fn main() { for i in 0..10 {} }
probably also has something to do with iterators. And indeed, it does.
The expression 0..10
is actually an iterator.
It starts at 0, then every time you call next
on it, it advances.
Until, after 9 it returns None
fn main(){ let mut range = 0..10; assert_eq!(range.next(), Some(0)); assert_eq!(range.next(), Some(1)); assert_eq!(range.next(), Some(2)); assert_eq!(range.next(), Some(3)); }
Infinite Iterators
The definition of an iterator is very simple.
It's anything that has the notion of a next
element.
Although an iterator can stop whenever it returns None
, nothing says that it has to.
An iterator that simply never returns None
is an infinite iterator, and that's perfectly okay.
You could for example write:
fn main() { for i in std::iter::repeat(1) { println!("{i}"); } }
Which will infinitely print the number 1. For any arbitrary iterator, you can also make it infinite by cycling it:
fn main() { for i in vec![1, 2, 3].iter().cycle() { println!("{i}"); } }
Which will print the numbers 1, 2, 3 and then 1 again, forever.
cycle()
is a method available on any iterator, and these methods are often called adapters
Adapters
Adapters are functions on iterators, that can turn an iterator into something else.
Let's look at an example of another adapter: map()
.
With map()
, we can transform one type of iterator into another.
For example, the following code transforms an iterator over integers, into one of floats which are twice as small as the integer:
fn main() { let data = vec![1, 2, 3, 4]; for i in data .into_iter() .map(|i| i as f64 / 2.0) { // prints 0.5; 1.0; 1.5; 2.0 println!("{i}"); } }
Sometimes, an adapter can also change the length of an iterator.
For example, take()
will stop the iterator after a number of elements have passed.
fn main() { let data = vec![1, 2, 3, 4]; for i in data .into_iter() .take(2) { // prints 1; 2; println!("{i}"); } }
The most important ones are:
.collect()
: Turns an iterator into a datastructure (like a vec), consuming all the items in the iterator..map(|x| some_op(x))
: Transforms all elements of an iterator by applying the function..filter(|x| some_predicate(x))
: Only keeps the elements in the iterator that satisfy the predicate..find(|x| some_predicate(x))
: Returns the first item that satisfies the predicate (transforms an iterator into a single value)..take(n)
: Returns a new iterator which is at most n elements long..position(|x| some_predicate(x))
: Returns the index of the first element that satisfies the predicate..nth(n)
: Returns the nth element in the iterator (like indexing). Note that.nth(n)
will advance the iterator. Running.nth(n)
twice won't yield the same element twice.
You can find more methods in the rust documentation.
The following is called a closure:
#![allow(unused)] fn main() { |a, b, ..., z | { } }
A closure is a way quickly make a function that you can use as an expression. For example, to use as a parameter to another function. The parameters are put within vertical bars (|), and then a block or expression follows with the function body.
A closure can use variables defined in an enclosing scope.
Many methods on iterators are "lazy". For example, if you run .map()
on an iterator, you're not actually mapping every
element right there. Instead, the calculation is stored, and only performed for items that you're actually using in the
program (for example, when you collect
to put all items in a collection).
For example:
#![allow(unused)] fn main() { // make a very long vec let mut v = Vec::new(); for i in 0..10000 { v.push(i); } // iterate over the vec let res: Vec<_> = v.into_iter() // only keep items divisible by 3 .filter(|x| *x % 3 == 0) // multiply all of those items by 2 .map(|x| x * 2) // take only the first 20 items .take(20) // put those in a vector .collect(); println!("{:?}", res); }
In this example, an eager system would first apply 10 000 filter operations, and then roughly 3333 map operations, only to throw almost everything away (we only need 20 items). But because iterators are lazy, the map function actually only runs 20 times. We only need 20 items. The filter function will run a bit more often (some items are discarded). But still only 60 times. That saves a lot of computation.
In fact, this lazy evaluation is what allows iterators to be infinite.
It is perfectly fine to have an iterator over an infinite number of elements, as long as you don't keep all infinity of them in memory, but simply generate the next()
one every time.
#![allow(unused)] fn main() { let res = // create an iterator over all numbers from one to infinity // So this iterator is technically infinitely long. It just means // it will never return "None" when .next() is called. (1..) // multiply them all by 2 (this creates the even numbers) .map(|i| i * 2) // finds the first number that's divisible by 21 .find(|i| i % 21 == 0); // 42 is the first even number divisible by 21 assert_eq!(res, Some(42)); }