Lecture 5: Collections and Iterators

Contents

Collections

Built into the Rust standard library, is a module called "collections". In it, there are a number of commonly used datastructures. One of those is the Vec, which we have seen before. In this lecture, we will also look at HashMap and HashSet, two datastructures you might find useful.

If you prefer listening or watching a more concise lecture about this topic, we highly recommend the video by Jon Gjengset about this topic.

Hashmap

A HashMap is a data structure that works a bit like a database. Elements in a HashMap always have some kind of key, together with an associated item. For example, in a database of students, the key might be your student ID and the value is the information associated with that student ID.

Let's imagine we are designing our own HashMap of keys and values. We might imagine a vector of tuples:

struct Student {
  name: &'static str,
  average_grade: f64,
}

fn main() {
    let mut v = Vec::new();
    v.push((1, Student {name: "Sarah", average_grade: 7.6}));
    v.push((12, Student {name: "Tim", average_grade: 8.2}));
    v.push((34, Student {name: "Josh", average_grade: 5.9}));
    v.push((42, Student {name: "Cat", average_grade: 9.1}));
}

The type of each pair is (i32, Student) here. Let's say that we would like to find the student with ID 42. To do that, we can simply iterate over all students until we find Cat.

#![allow(unused)]
fn main() {
// find a value for a specific key
println!("{}", v.iter().find(|(k, _)| k == 42)); 
// prints: Some(Student {name: "Cat", ..})
}

In this case the list of students is not too long, so finding the student with ID 42 won't take too long. But the longer the list becomes, the slower it is to find specific students. Unless we are lucky of course. If we had wanted to find the student with ID 1 in this example, it would be very quick since Sarah happens to be the first student in our list!

We call an operation that, like this one, becomes slower the more items we add: O(n). That notation essentially says that the time it takes grows like a linear function.

So can we do better? What if we knew that all the keys are numbers between 0 and 100. We could create an array of 100 elements like this:

#[derive(Copy, Clone)]
struct Student { 
  name: &'static str,
  average_grade: f64,
}

fn main() {
    let mut v = [None; 100];
  
    v[1]  = Student {name: "Sarah", average_grade: 7.6};
    v[12] = Student {name: "Tim", average_grade: 8.2};
    v[34] = Student {name: "Josh", average_grade: 5.9};
    v[42] = Student {name: "Cat", average_grade: 9.1};

    // find the value of a key:
    println!("{}", v[14]); // None
    println!("{}", v[42]); // Some(Student {name: "Cat", ..})
}

On the last line here, we attempted to find a student with ID 42 and instantly found Cat, since their information was stored at index 42. To find that, we did not actually need to search through the array, so the operation becomes what we call constant time or O(1). The speed of this lookup does not at all depend on how many students we store, as long as we know the ID of the student we are looking for.

Notice here, that we can only have a single student with an ID. There's only a single place for every ID, so when we add a second student with an ID that was already in use, the second student would overwrite the first.

The instant-lookup trick with arrays only works when our keys are integers. Let's say we want to find people by their name, we cannot make an array where we can use strings to index it. Furthermore, if we don't know a maximum value for these integers, we might need to allocate enormous amounts of memory so that is infeasible as well.

That, is what a HashMap is. It's like an array, that can do these instant-lookups by their key, but unlike the array example it works for almost any key imaginable. Let's look at the example below. Here we indeed use student's names as a key, not their student number.

use std::collections::HashMap;

#[derive(Copy, Clone)]
struct Student {
  student_id: u64,
  average_grade: f64,
}

fn main() {
  let mut v = HashMap::new();

  v.insert("Sarah", Student { student_id: 1, average_grade: 7.6 });
  v.insert("Tim", Student { student_id: 12, average_grade: 8.2 });
  v.insert("Josh", Student { student_id: 34, average_grade: 5.9 });
  v.insert("Cat", Student { student_id: 42, average_grade: 9.1 });

  v.get("Cat") // Some(Student {student_id: 42, ..})
}

The code looks pretty similar to the original Vec example, but now using get instead of indexing. With a HashMap, on average, it always takes the same amount of time, regardless of the size of the hashmap to find items. Do note, that HashMaps do not have a defined order. Because of how arbitrary keys are turned into indices in an array (which we call hashing)

Let's look at another example where HashMap is useful: counting occurrences of words. Let's say we have a large text, like this assignment, and we are interested in how many times each word is used. We expect to find words like 'the' and 'and' and 'a' to be very frequent.

Let's use this as our setup:

fn find_occurences(data: &str) {
  // split on words
  for word in data.split(' ') { 
    
  }
}

fn main() {
  let words = include_str!("lecture-5-iterators.md");
  find_occurences(words);
}
At this point, first think about how you would do this while using a `HashMap` without reading ahead, where will show the solution. To go ahead, open the tab below.
fn find_occurences(data: &str) {
  let counter = HashMap::new();
  
  // split on words
  for word in data.split(' ') {
    if counter.contains_key(word) {
      // if the word was seen before, add 1
      *counter.get_mut(word) += 1;
    } else {
      // if the word wasn't seen before, now we've seen it once
      counter.insert(word, 1);
    }
  }
  
  println!("{}", counter);
}

fn main() {
  let words = include_str!("lecture-5-iterators.md");
  find_occurences(words);
}

Other operations on a hashmap are for example: contains_key (gives a boolean saying whether a key is present) and remove (deletes a key-value pair by its key). For more information, see the standard library.

HashSet

A HashSet is in many ways similar to a HashMap. You can think of a HashSet like a HashMap with no values, only keys. You might think that such a data structure is absolutely useless. However, that's not at all true, because of two properties HashMap and therefore also HashSet has.

  1. A HashMap/HashSet cannot contain duplicate elements
  2. You can see if a HashMap/HashSet contains an element in constant time.

A HashSet can therefore be useful to for example find if there are duplicates in a list of items, like this:

#![allow(unused)]
fn main() {
fn has_duplicates(elems: &[u64]) -> bool {
    // make a hashset
    let res = HashSet::new();
  
    // put all elements in
    for i in elems {
      // notice: this is a set so we don't give a value
      res.insert(i);
    }
    
    // if an element in `elems` was already seen before, it will overwrite
    // go at the same location as the previous occurence in the HashSet
    // thus, the total length of the set will be smaller than the length of `elems`
    res.len() < elems.len()
}
}

You can use a very similar program to find "items you have seen previously". You put every item you see in the Set and then when you notice it overwrites another element you found your first duplicate.

Iterators

The different collections in Rust, all have a method called .iter(). This method, returns an iterator. But what is an iterator?

Simply put, an iterator is any value that may have some kind of next value. For example, you can turn a Vec into an iterator (through iter()) and go from one item to another:

fn main() {
  // a collection of items
  let items = vec![1, 2, 3, 4, 5];
  
  // make an iterator
  let mut iterator = items.iter();
  
  assert_eq!(iterator.next(), Some(&1));
  assert_eq!(iterator.next(), Some(&2));
  assert_eq!(iterator.next(), Some(&3));
  assert_eq!(iterator.next(), Some(&4));
  assert_eq!(iterator.next(), Some(&5));
  // when there are no more elements, return None
  assert_eq!(iterator.next(), None);
}

Notice that Vec is not an iterator itself. We can create an iterator that iterates over a Vec, but Vec does not have this property of having a next item on its own. When we create an iterator over a Vec, we essentially start by pointing to the first element. Every time we call next, we advance to the next item, and at some point next may return None signalling that there is no next item.

At this point, you may notice that this is not the only possible iterator over a Vec. Indeed, you could also write:

fn main() {
  // a collection of items
  let items = vec![1, 2, 3, 4, 5];

  // make an iterator
  let mut iterator = items.iter().rev();

  assert_eq!(iterator.next(), Some(&5));
  assert_eq!(iterator.next(), Some(&4));
  assert_eq!(iterator.next(), Some(&3));
}

To iterate over the Vec in reverse, from the end to the start.

For loops

In many languages, a for loop is an abstraction over simpler loop kinds to iterate a specific number of times. Although this is technically true in Rust too, it might be more accurate to see a for loop as an abstraction over "finishing an iterator". A for loop has to ability to take any iterator, and repeat a block of code until the iterator is exhausted.

Let's take the vec example again from above

fn main() {
    // a collection of items
    let items = vec![1, 2, 3, 4, 5];

    // make an iterator
    let iterator = items.iter();

    // run the iterator until the end
    for i in iterator {
        println!("{}", i);
    }
}

The for loop takes the iterator, and prints every element until the end, where the end is marked by the iterator returning None. Because this is an abstraction, we can look at what the more primitive while-loop version of this code looks like to understand better what is going on:

fn main() {
  // a collection of items
  let items = vec![1, 2, 3, 4, 5];

  // make an iterator
  let iterator = items.iter();

  // run the iterator until the end
  let mut temp = iterator.next();
  while temp != None {
    // can't fail because we just looked it's not None
    let i = temp.unwrap();
    
    // original code in the for loop
    println!("{}", i);
    
    // advance the iterator
    temp = iterator.next();
  }
}

So what can we learn from this? One thing is that the for loop you have most likely used most:

#![allow(unused)]
fn main() {
for i in 0..10 {}
}

probably also has something to do with iterators. And indeed, it does. The expression 0..10 is actually an iterator. It starts at 0, then every time you call next on it, it advances. Until, after 9 it returns None

fn main(){
    let mut range = 0..10;
    assert_eq!(range.next(), Some(0));
    assert_eq!(range.next(), Some(1));
    assert_eq!(range.next(), Some(2));
    assert_eq!(range.next(), Some(3));
}

Infinite Iterators

The definition of an iterator is very simple. It's anything that has the notion of a next element. Although an iterator can stop whenever it returns None, nothing says that it has to. An iterator that simply never returns None is an infinite iterator, and that's perfectly okay.

You could for example write:

fn main() {
  for i in std::iter::repeat(1) {
    println!("{i}");
  }
}

Which will infinitely print the number 1. For any arbitrary iterator, you can also make it infinite by cycling it:

fn main() {
  for i in vec![1, 2, 3].iter().cycle() {
    println!("{i}");
  }
}

Which will print the numbers 1, 2, 3 and then 1 again, forever. cycle() is a method available on any iterator, and these methods are often called adapters

Adapters

Adapters are functions on iterators, that can turn an iterator into something else. Let's look at an example of another adapter: map(). With map(), we can transform one type of iterator into another. For example, the following code transforms an iterator over integers, into one of floats which are twice as small as the integer:

fn main() {
    let data = vec![1, 2, 3, 4];
  
    for i in data
            .into_iter()
            .map(|i| i as f64 / 2.0)
    {
        // prints 0.5; 1.0; 1.5; 2.0
        println!("{i}");
    }
}

Sometimes, an adapter can also change the length of an iterator. For example, take() will stop the iterator after a number of elements have passed.

fn main() {
  let data = vec![1, 2, 3, 4];

  for i in data
          .into_iter()
          .take(2)
  {
    // prints 1; 2;
    println!("{i}");
  }
}

The most important ones are:

  • .collect(): Turns an iterator into a datastructure (like a vec), consuming all the items in the iterator.
  • .map(|x| some_op(x)): Transforms all elements of an iterator by applying the function.
  • .filter(|x| some_predicate(x)): Only keeps the elements in the iterator that satisfy the predicate.
  • .find(|x| some_predicate(x)): Returns the first item that satisfies the predicate (transforms an iterator into a single value).
  • .take(n): Returns a new iterator which is at most n elements long.
  • .position(|x| some_predicate(x)): Returns the index of the first element that satisfies the predicate.
  • .nth(n): Returns the nth element in the iterator (like indexing). Note that .nth(n) will advance the iterator. Running .nth(n) twice won't yield the same element twice.

You can find more methods in the rust documentation.

The following is called a closure:

#![allow(unused)]
fn main() {
|a, b, ..., z | {

}
}

A closure is a way quickly make a function that you can use as an expression. For example, to use as a parameter to another function. The parameters are put within vertical bars (|), and then a block or expression follows with the function body.

A closure can use variables defined in an enclosing scope.

Many methods on iterators are "lazy". For example, if you run .map() on an iterator, you're not actually mapping every element right there. Instead, the calculation is stored, and only performed for items that you're actually using in the program (for example, when you collect to put all items in a collection).

For example:

#![allow(unused)]
fn main() {
// make a very long vec
let mut v = Vec::new();

for i in 0..10000 {
    v.push(i);
}

// iterate over the vec
let res: Vec<_> = v.into_iter()
    // only keep items divisible by 3
    .filter(|x| *x % 3 == 0)
    // multiply all of those items by 2
    .map(|x| x * 2)
    // take only the first 20 items
    .take(20) 
    // put those in a vector
    .collect();


println!("{:?}", res);
}

In this example, an eager system would first apply 10 000 filter operations, and then roughly 3333 map operations, only to throw almost everything away (we only need 20 items). But because iterators are lazy, the map function actually only runs 20 times. We only need 20 items. The filter function will run a bit more often (some items are discarded). But still only 60 times. That saves a lot of computation.

In fact, this lazy evaluation is what allows iterators to be infinite. It is perfectly fine to have an iterator over an infinite number of elements, as long as you don't keep all infinity of them in memory, but simply generate the next() one every time.

#![allow(unused)]
fn main() {
let res =
    // create an iterator over all numbers from one to infinity
    // So this iterator is technically infinitely long. It just means
    // it will never return "None" when .next() is called.
    (1..)
    // multiply them all by 2 (this creates the even numbers)
    .map(|i| i * 2)
    // finds the first number that's divisible by 21
    .find(|i| i % 21 == 0);
    
// 42 is the first even number divisible by 21
assert_eq!(res, Some(42));
}