Concurrency Assignment

Grep

For this assignment, you will be implementing a concurrent version of grep using some of the concurrency primitives you learned about in the lecture. Afterwards, you are going to dive into the inner workings of some of these primitives.

Grep is a command-line utility for searching a file or directory for lines that match a regular expression. For example, this searches the current directory for files that contain rust or Rust. You can use Grep to search through a directory for specific words in files with the following command:

grep -r "[Rr]ust" .

You may be more familiar with using grep together with pipes. For example:

cat <somefile> | grep <somestring>

This is however not what you will implement in this assignment. Instead, the tool you will make will only be able to search through directories (recursively). The program will function like grep -r. This is very similar to what a well-known tool does called ripgrep. We won't ask you to match ripgrep's performance, since ripgrep has been optimized a lot. However, what may be interesting to know is that the author of ripgrep is also the author of the regex library that you will most likely use for this assignment.

Example usage

A simple example can be found in the template in ./examples/example1/. It contains a single file. If you run cargo run -- banana ./examples/example1/, the expected output is:

>>> (#0) ./examples/example1/file.txt
apple banana pear
      ^^^^^^

For a slightly more complex example, take a look at ./examples/example2/. If you run cargo run -- banana ./examples/example2/, the expected output is:

>>> (#0) ./examples/example2/file1.txt
banana
^^^^^^

>>> (#1) ./examples/example2/dir/file3.txt
banana
^^^^^^

Finally, it is possible to specify multiple directories. If we specify both examples using cargo run -- banana ./examples/example1/ ./examples/example2/, the expected output is:

>>> (#0) ./examples/example2/file1.txt
banana
^^^^^^

>>> (#1) ./examples/example1/file.txt
apple banana pear
      ^^^^^^

>>> (#2) ./examples/example2/dir/file3.txt
banana
^^^^^^

For a big example, if you run the following on the linux source code (use git clone https://github.com/torvalds/linux --depth=1 to clone the repo), you should find 6 results. The [mM]icrodevice is the regex that grep should try to match, the . is the directory it should search in, . meaning the current directory.

cargo run -- [mM]icrodevice .

Requirements

This assignment consists of two main parts. Some source code you hand in through a GitLab repository you get, as well as a report which should be a PDF you hand in the root of the same git repository. With each part you can gain points, which start counting at 0 and are capped at a 10 which will be your grade. It is required to hand in both source code and a report. Not doing one of the two at all means you fail the assignment.

Source Code

In order to receive up to 4 points for this assignment, the code you hand in must:

  • Be able to search a file or directory for all occurrences of a regex pattern.
    • This means if a pattern matches multiple times in a single file, you should report all matches
  • Print output while it is searching. (It shouldn't buffer the results and print them at the end)
  • Print the results only using the provided Display implementation. Results should be put into GrepResult structs and printed. This will make sure their formatting is consistent.
  • Make sure the search_ctr numbers during printing are in increasing order.
  • Make sure the output is well-formed.
  • Be concurrent, that is, should use multiple threads.
  • Be thread safe. Any pattern that may work for a limited input, or which depends on being lucky with how the scheduler schedules your threads is considered incorrect.
  • Not use any unsafe code.
  • Compile on stable rust (1.73).
  • Dynamically determine how many threads it should use based on the number of cores the system has.
    • You must not spawn many times more threads than your system has cores. On large folders, you should not spawn millions of threads.

Banned Libraries

You are not allowed to use any of the following libraries:

  • num_cpus (this is implemented in std nowadays)
  • walkdir
  • Any library that implements the behaviour of grep already (if unsure: ask a TA)

To gain additional points, you can do any of the following tasks:

  • Be faster than GNU grep and show this in your report based on a benchmark: 2 points
  • Your code does not panic (hint: Mutex is special): 0.5 points
  • Experiment with rayon
    • Implement the same code using rayon library, and choose this topic for one of your report subjects: 1 point.
    • You can also only choose rayon as your report subject and not implement grep using it. That is still a valid report, but will not get you the 1 extra point of this task
  • Experiment with async
    • Implement the same code using async rust, and choose this topic for one of your report subjects: 1 point.
    • You can also only choose async as your report subject and not implement grep using it. That is still a valid report, but will not get you the 1 extra point of this task

Report

Write a report of at most 2 pages; one per subject and possibly 1 more page if you want to report on the performance of your code as outlined in the task about being faster than GNU grep. (so max 3 pages in total!) You should choose two subjects from the following list, one per group member. Your report can gain you a maximum of 3 points. Note that you do not actually need to implement your code using any of these techniques.

  • Arc vs Rc in std: explain the differences, when you would use each, and how each works internally. Talk about the difference between strong and weak reference counts, and what Send and Sync mean. Generally show to us that you understand the implementation.
  • Mutex vs RwLock in std: explain the differences, in which situations you would use each, and how each works internally. Talk about how a RwLock is implemented, and what the term "interior mutability" means and what this has to do with a Mutex. Generally show to us that you understand the two.
  • (std::sync::mpsc) Channels: explain how a channel works internally. Talk about the different channel variants, and based on what factors Rust chooses the variant at runtime. This may mean you have to look at the source code implementation of channels. Talk about the advantages of a channel compared to a Mutex. Generally show to us that you understand how channels work.
  • Atomics and Memory Ordering: Explain how atomics work in hardware on x86_64, and how they relate to the cache of a cpu. Explain what a memory ordering is, and why the different memory orderings are important and how this relates to memory models. Why does compare_exchange need two memory orderings? Also demonstrate the use of an atomic variable in your grep assignment.
  • Rayon vs Advanced Computing Systems: you may notice that the rayon library can do many things that the OpenMP library can do which you used during the Advanced Computing Systems course. Report on how rayon schedules tasks, how it is different to OpenMP and what your experience of using it is compared to OpenMP. Also talk about what the advantage is of using rayon over the custom thread-based approach you used for the base implementation of your grep assignment. Note that you can also implement grep using rayon for bonus points, see the requirements for grep.
  • Async/Await: Rust has a feature that allows code to asynchronously wait for jobs to complete, moving scheduling from the operating system to user space. Report on how async/await works, how it is different to async/await in other languages like for example Python, C# or JavaScript, and in what situations you might want to use async/await over normal threading. Also discuss what an executor is. Note that you can also implement grep using async/await for bonus points, see the requirements for grep.