Concurrency Assignment
Grep
For this assignment, you will be implementing a concurrent version of grep using some of the concurrency primitives you learned about in the lecture. Afterwards, you are going to dive into the inner workings of some of these primitives.
Grep is a command-line utility for searching a file or directory for lines that match a regular expression.
For example, this searches the current directory for files that contain rust
or Rust
.
You can use Grep to search through a directory for specific words in files with the following command:
grep -r "[Rr]ust" .
You may be more familiar with using grep together with pipes. For example:
cat <somefile> | grep <somestring>
This is however not what you will implement in this assignment. Instead, the tool you will make will only be able
to search through directories (recursively). The program will function like grep -r
. This is very similar to
what a well-known tool does called ripgrep. We won't ask you to match
ripgrep's performance, since ripgrep has been optimized a lot. However, what may be interesting to know is that
the author of ripgrep is also the author of the regex library that you will most likely use for this assignment.
Example usage
A simple example can be found in the template in ./examples/example1/
. It contains a single file.
If you run cargo run -- banana ./examples/example1/
, the expected output is:
>>> (#0) ./examples/example1/file.txt
apple banana pear
^^^^^^
For a slightly more complex example, take a look at ./examples/example2/
.
If you run cargo run -- banana ./examples/example2/
, the expected output is:
>>> (#0) ./examples/example2/file1.txt
banana
^^^^^^
>>> (#1) ./examples/example2/dir/file3.txt
banana
^^^^^^
Finally, it is possible to specify multiple directories.
If we specify both examples using cargo run -- banana ./examples/example1/ ./examples/example2/
, the expected output is:
>>> (#0) ./examples/example2/file1.txt
banana
^^^^^^
>>> (#1) ./examples/example1/file.txt
apple banana pear
^^^^^^
>>> (#2) ./examples/example2/dir/file3.txt
banana
^^^^^^
For a big example, if you run the following on the linux source code (use git clone https://github.com/torvalds/linux --depth=1
to clone the repo), you should find 6 results.
The [mM]icrodevice
is the regex that grep should try to match, the .
is the directory it should search in, .
meaning the current directory.
cargo run -- [mM]icrodevice .
Requirements
This assignment consists of two main parts. Some source code you hand in through a GitLab repository you get, as well as a report which should be a PDF you hand in the root of the same git repository. With each part you can gain points, which start counting at 0 and are capped at a 10 which will be your grade. It is required to hand in both source code and a report. Not doing one of the two at all means you fail the assignment.
Source Code
In order to receive up to 4 points for this assignment, the code you hand in must:
- Be able to search a file or directory for all occurrences of a regex pattern.
- This means if a pattern matches multiple times in a single file, you should report all matches
- Print output while it is searching. (It shouldn't buffer the results and print them at the end)
- Print the results only using the provided
Display
implementation. Results should be put intoGrepResult
structs and printed. This will make sure their formatting is consistent. - Make sure the
search_ctr
numbers during printing are in increasing order. - Make sure the output is well-formed.
- Be concurrent, that is, should use multiple threads.
- Be thread safe. Any pattern that may work for a limited input, or which depends on being lucky with how the scheduler schedules your threads is considered incorrect.
- Not use any unsafe code.
- Compile on stable rust (1.73).
- Dynamically determine how many threads it should use based on the number of cores the system has.
- You must not spawn many times more threads than your system has cores. On large folders, you should not spawn millions of threads.
Banned Libraries
You are not allowed to use any of the following libraries:
num_cpus
(this is implemented in std nowadays)walkdir
- Any library that implements the behaviour of
grep
already (if unsure: ask a TA)
To gain additional points, you can do any of the following tasks:
- Be faster than GNU
grep
and show this in your report based on a benchmark: 2 points - Your code does not panic (hint:
Mutex
is special): 0.5 points - Experiment with rayon
- Implement the same code using
rayon
library, and choose this topic for one of your report subjects: 1 point. - You can also only choose
rayon
as your report subject and not implement grep using it. That is still a valid report, but will not get you the 1 extra point of this task
- Implement the same code using
- Experiment with async
- Implement the same code using async rust, and choose this topic for one of your report subjects: 1 point.
- You can also only choose async as your report subject and not implement grep using it. That is still a valid report, but will not get you the 1 extra point of this task
Report
Write a report of at most 2 pages; one per subject and possibly 1 more page if you want to report on the performance of your code as outlined in the task about being faster than GNU grep
. (so max 3 pages in total!)
You should choose two subjects from the following list, one per group member.
Your report can gain you a maximum of 3 points. Note that you do not actually need to implement your code using any of these techniques.
Arc
vsRc
in std: explain the differences, when you would use each, and how each works internally. Talk about the difference between strong and weak reference counts, and whatSend
andSync
mean. Generally show to us that you understand the implementation.Mutex
vsRwLock
in std: explain the differences, in which situations you would use each, and how each works internally. Talk about how aRwLock
is implemented, and what the term "interior mutability" means and what this has to do with aMutex
. Generally show to us that you understand the two.- (
std::sync::mpsc
) Channels: explain how a channel works internally. Talk about the different channel variants, and based on what factors Rust chooses the variant at runtime. This may mean you have to look at the source code implementation of channels. Talk about the advantages of a channel compared to a Mutex. Generally show to us that you understand how channels work. - Atomics and Memory Ordering: Explain how atomics work in hardware on x86_64, and how they relate to the cache of a cpu. Explain what a memory ordering is, and why the different memory orderings are important and how this relates to memory models. Why does
compare_exchange
need two memory orderings? Also demonstrate the use of an atomic variable in yourgrep
assignment. - Rayon vs Advanced Computing Systems: you may notice that the
rayon
library can do many things that the OpenMP library can do which you used during the Advanced Computing Systems course. Report on how rayon schedules tasks, how it is different to OpenMP and what your experience of using it is compared to OpenMP. Also talk about what the advantage is of using rayon over the custom thread-based approach you used for the base implementation of yourgrep
assignment. Note that you can also implementgrep
using rayon for bonus points, see the requirements forgrep
. - Async/Await: Rust has a feature that allows code to asynchronously wait for jobs to complete, moving scheduling from the operating system to user space. Report on how async/await works, how it is different to async/await in other languages like for example Python, C# or JavaScript, and in what situations you might want to use async/await over normal threading. Also discuss what an executor is. Note that you can also implement
grep
using async/await for bonus points, see the requirements forgrep
.