Concurrency Assignment

Grep

For this assignment, you will be implementing a concurrent version of grep using some of the concurrency primitives you learned about in the lecture. Afterwards, you are going to dive into the inner workings of some of these primitives.

Grep is a command-line utility for searching a file or directory for lines that match a regular expression. For example, this searches the current directory for files that contain rust or Rust. You can use Grep to search through a directory for specific words in files with the following command:

grep -r "[Rr]ust" .

You may be more familiar with using grep together with pipes. For example:

cat <somefile> | grep <somestring>

This is however not what you will implement in this assignment. Instead, the tool you will make will only be able to search through directories (recursively). The program will function like grep -r. This is very similar to what a well-known tool does called ripgrep. We won't ask you to match ripgrep's performance, since ripgrep has been optimized a lot. However, what may be interesting to know is that the author of ripgrep is also the author of the regex library that you will most likely use for this assignment.

Example usage

A simple example can be found in the template in ./examples/example1/. It contains a single file. If you run cargo run -- banana ./examples/example1/, the expected output is:

>>> (#0) ./examples/example1/file.txt
apple banana pear
      ^^^^^^

For a slightly more complex example, take a look at ./examples/example2/. If you run cargo run -- banana ./examples/example2/, the expected output is:

>>> (#0) ./examples/example2/file1.txt
banana
^^^^^^

>>> (#1) ./examples/example2/dir/file3.txt
banana
^^^^^^

Finally, it is possible to specify multiple directories. If we specify both examples using cargo run -- banana ./examples/example1/ ./examples/example2/, the expected output is:

>>> (#0) ./examples/example2/file1.txt
banana
^^^^^^

>>> (#1) ./examples/example1/file.txt
apple banana pear
      ^^^^^^

>>> (#2) ./examples/example2/dir/file3.txt
banana
^^^^^^

For a big example, if you run the following on the linux source code (use git clone https://github.com/torvalds/linux --depth=1 to clone the repo), you should find 6 results. The [mM]icrodevice is the regex that grep should try to match, the . is the directory it should search in, . meaning the current directory.

cargo run -- [mM]icrodevice .

Learning Objectives

In this assignment, the goal is to learn about:

  1. Implementing concurrency using Rust primitives
  2. Understanding and utilizing atomic operations
  3. Exploring Rust’s threading and parallelism libraries
  4. Optimizing concurrent code for performance

Requirements

This assignment consists of two main parts.

  1. Source code you hand in through a GitLab repository you get (7 points)
  2. A report which should be a PDF you hand in the root of the same git repository (3 points)

With each part you can gain points, which start counting at 0 and are capped at a 10 which will be your grade. It is required to hand in both source code and a report. Not doing one of the two at all means you fail the assignment.

Source Code: 7 points

In order to receive up to 4 points for this assignment, the code you hand in must:

  • Be able to search a file or directory for all occurrences of a regex pattern.
    • This means if a pattern matches multiple times in a single file, you must report all matches
  • Print output while it is searching. (It shouldn't buffer the results and print them at the end)
  • Print the results only using the provided Display implementation. Results should be put into GrepResult structs and printed. This will make sure their formatting is consistent.
  • Make sure the search_ctr numbers during printing are in increasing order.
  • Make sure the output is well-formed.
  • Be concurrent, that is, should use multiple threads.
  • Be thread safe. Any pattern that may work for a limited input, or which depends on being lucky with how the scheduler schedules your threads is considered incorrect.
  • Not use any unsafe code.
  • Compile on stable rust (1.81).
  • Dynamically determine how many threads it should use based on the number of cores the system has.
    • You must not spawn many times more threads than your system has cores. On large folders, you should not spawn millions of threads.

Banned Libraries

You are not allowed to use any of the following libraries:

  • num_cpus (this is implemented in std nowadays)
  • walkdir
  • Any library that implements the behaviour of grep already (if unsure: ask a TA)

To gain additional points, you can do any of the following tasks:

  • Be faster than GNU grep and show this in your report based on a benchmark: 2 points
  • Your code does not panic (hint: Mutex is special): 0.5 points
  • Experiment with rayon: 1 point
    • Implement the same code using rayon library, and choose this topic for one of your report subjects
  • Experiment with async: 1 point
    • Implement the same code using async rust, and choose this topic for one of your report subjects

Rayon and Async Rust

You can also choose rayon or async rust as your report subject and not implement grep using it. That is still a valid report, but will not get you the 1 extra point for the source code section

Report: 3 points

Write a report consisting of the following

  • Write up on two topics (one per team member) of your choice from the list below: 2 points
  • performance report with benchmark if your implementation is faster than GNU grep 1 point

The report must not exceed 3 pages:

  • 1 page maximum for each topic
  • Additional page for the performance report

You should choose two subjects from the following list, one per group member. Note that you do not actually need to implement your code using any of these techniques.

A report that aims to get a 10 should, both in style and in citation, try to emulate Stacked borrows: an aliasing model Rust

What to write?

The goal of the report is to showcase your grasp of a concept. Rather than demonstrating that you’ve understood the documentation, aim to go beyond it in your investigation. The expectation is for the topics to be explored, analyzed, and deeply understood. Teach us something new—something we might not know! Explore various resources such as papers, blog posts, and videos (We expect you to cite at least one reputable source). The idea is to craft each topic write-up like a small-scale exploratory paper. You could also think of creative ways to apply the concept differently than usual. It’s up to you—make it fun, make it interesting!

  • Arc vs Rc in std: explain the differences, when you would use each, and how each works internally. Talk about the difference between strong and weak reference counts, and what Send and Sync mean. Generally show to us that you understand the implementation.
  • Mutex vs RwLock in std: explain the differences, in which situations you would use each, and how each works internally. Talk about how a RwLock is implemented, and what the term "interior mutability" means and what this has to do with a Mutex. Generally show to us that you understand the two.
  • (std::sync::mpsc) Channels: explain how a channel works internally. Talk about the different channel variants, and based on what factors Rust chooses the variant at runtime. This may mean you have to look at the source code implementation of channels. Talk about the advantages of a channel compared to a Mutex. Generally show to us that you understand how channels work.
  • Atomics and Memory Ordering: Explain how atomics work in hardware on x86_64, and how they relate to the cache of a cpu. Explain what a memory ordering is, and why the different memory orderings are important and how this relates to memory models. Why does compare_exchange need two memory orderings? Also demonstrate the use of an atomic variable in your grep assignment.
  • Rayon vs Advanced Computing Systems: you may notice that the rayon library can do many things that the OpenMP library can do which you used during the Advanced Computing Systems course. Report on how rayon schedules tasks, how it is different to OpenMP and what your experience of using it is compared to OpenMP. Also talk about what the advantage is of using rayon over the custom thread-based approach you used for the base implementation of your grep assignment. Note that you can also implement grep using rayon for bonus points, see the requirements for grep.
  • Async/Await: Rust has a feature that allows code to asynchronously wait for jobs to complete, moving scheduling from the operating system to user space. Report on how async/await works, how it is different to async/await in other languages like for example Python, C# or JavaScript, and in what situations you might want to use async/await over normal threading. Also discuss what an executor is. Note that you can also implement grep using async/await for bonus points, see the requirements for grep.