代写COMP26120、代做C++, Java/Python编程
- 首页 >> C/C++编程 Academic Session: 2023-24
Lab Exercise 3: Spellchecking (Better Trade-offs)
Duration: 4 weeks
You should do all your work in the lab3 directory of the COMP26120 2023 repository - see Blackboard
for further details. You will need to make use of the existing code in the branch as a starting point.
Important: You submit this lab via a quiz on Blackboard. This will:
1. Ask you some questions about your implementation, including the hash of the commit you want
us to mark (see below for details of this).
2. Ask you to upload a zip file of the python, c or java folder you have been working in.
3. Ask you to upload a PDF report of the experiments you will run in Part C of this lab.
4. Ask you for a statement on whether and (if relevant) how you used generative AI tools in your
work. If you used generative AI you also have to upload a file containing transcripts of your
usage.
You can save your answers and submit later so we recommend filling in the questions for each part as
you complete it, rather than entering everything at once at the end.
• You MUST fill in and submit the quiz (at least some of it) to get marked.
• Reports MUST be uploaded to Blackboard if you want them marked.
• Code you want marked MUST be in the commit hash you give us in the Blackboard
quiz.
We will not mark anything pushed to GitLab if we don’t have the commit hash we can
identify from Blackboard. Submission time is recorded as the time the Blackboard quiz
is submitted, not the time you last pushed to GitLab or the last time you saved the quiz.
Code Submission Details
You have the choice to complete the lab in C, Java or Python. Program stubs for this exercise exist
for each language. Only one language solution will be marked.
Because people have had a number of issues with GitLab over the years we take a multiple redundancy approach to submission of code. This involves both pushing a commit to GitLab and uploading
a zip of your code to Blackboard. By preference we will mark the code you submitted to GitLab but
1
Figure 1: Identifying the hash of your most recent commit in GitLab
Figure 2: Identifying the hashes of previous commits in GitLab
if we can’t find it or it doesn’t check out properly then we will look in the zip file on Blackboard.
Please do both to maximise the chance that one of them will work.
When you submit the assignment through Blackboard you will be asked for the hash of the commit
you want marked. This is to make sure the TAs can identify exactly which GitLab commit you want
marked.
You can find the hash of your most recent commit by looking in your repository on GitLab as
shown in figure 1.
You can also find the hash for a previous commit by clicking on the “commits” link and then
identifying the commit you are interested in. This is shown in figure 2.
Note that while the full hash for commits are quite long, we only need the first 8 characters (as
shown in the screenshots) to identify for marking.
For this assignment, you may use built-in libraries in your chosen language but bear in
mind that we ask you to implement hash sets - simply using the Java HashSet class (for
instance) will not get you the marks for implementation.
Reminder: It is bad practice to include automatically generated files in source control (e.g. your git
repositories). This applies to object files (C), class files (Java), and compiled bytecode files (Python).
2
It’s not fatal if you do this but it can cause issues during marking so try not to.
While it is fine to discuss this coursework with your friends and compare notes, the work submitted
should be your own. In particular this means you should not have copied any of the source code,
or the report. We will be using the turnitin tool to compare reports for similarities.
Generative AI You may use generative AI for this assignment. If you choose to do so you should
read the guidance notes and watch the guidance video in the Blackboard folder for the lab. At the
start of the quiz you are asked to make a statement about your use of generative AI. If you leave
this blank it will be understood as a declaration that you have not used generative AI
for the coursework. The statement should make the following things clear:
1. Which generative AI tool you used.
2. What parts of the coursework you used it for: if you used it for the code was it for all the code
or only some of it? If just some, which bits? If you used it to help you answer quiz questions,
which quiz questions? if you used it to help with the report, which parts of the report?
3. What steps you took to ensure that the generative AI tool had provided correct information.
4. What evidence there is either in the statement, or in your coursework submission, that demonstrates you have met the coursework learning outcomes.
If you used generative AI you should also upload a zip file containing the transcripts of your usage in
question 2 of the Blackboard quiz.
3
Learning Objectives
By the end of this lab you will be able to:
• Explain the role of the hash function in the implementation of hash tables and describe at least
one hash collision strategy.
• Explain the basic structure of a binary search tree and the basic operations on this structure.
• Produce C, Java, or Python code to implement the above concepts.
• Evaluate experimentally the behaviour of hash sets and binary search trees.
Introduction
The aim of this exercise is to use the context of a simple spell-checking program to explore the hash
table and binary search tree data structures.
This exercise builds on work in Labs 1 and 2. If you have not done these labs we recommend you
take a look at them and their model solutions in order to familiarise yourself with the spell-checking
program and our expectations for creating and writing up experiments.
Getting the Support Code
You should be able to see a lab3 folder in your Gitlab that contains all the support files for this lab.
You can find instructions for accessing your gitlab on Blackboard.
Data Structures
The spell-checking program stores a dictionary of words using a Set datatype. There are two data
structures used to implement this Set datatype in this lab. In Lab 1 we introduced this datatype
but we repeat that information here. You may also need to look online (e.g. the Wikipedia page) to
complete your knowledge.
Set. This is an abstract datatype that is used to store the dictionary of words in our spell-checking
program. The operations required for this application are:
• find whether a given value (i.e. string, alphabetic word) appears in the set.
• insert a given value in the set. Note that there is no notion of multiplicity in sets - a value
either appears or it does not. Therefore, if insert is called with a duplicate value (i.e. it is
already in the set) it can be ignored.
There would usually also be a remove function but this is not required for this application. We also
include two further utility functions, which we ask you to implement, that are useful for printing what
is happening :
• print set to list the contents of a Set.
• print stats to output statistical information to show how well your code is working.
You should have already used a dynamic array to implement this dataype in Lab 1.
In this exercise we use hash tables and binary search trees to implement the Set datatype. These
have been introduced in lectures and we describe these briefly below. You may want to look at the
recommended textbooks or online to complete your knowledge.
4
Hash Table. The underlying memory structure for a hash table is an array. A hash function is
then used to map from a large ‘key’ domain to the (much smaller) set of indices into the array, where
a value can be stored. When two values in the input domain map to the same index in the array this
is called a collision and there are multiple ways to resolve these collisions. To use a hash table to
represent a set we make the key and value the same - the result is usually called a hash set.
Binary Search Tree. You can think of a tree as a linked list where every node has two ‘children’.
This allows us to differentiate between what happens in each sub-tree. In a binary search tree the
general idea is that the ‘left’ sub-tree holds values smaller than that found in the current node, and
the ‘right’ sub-tree holds larger values.
Lab 3: Better Storage
In Lab 1 we achieved a faster find function by first sorting the input. In this part we explore two data
structures that organise the data in a way that should make it faster to perform the find operation.
Part a: Hash Table Lookup
Time Budget: Students generally find Part a of this lab more challenging than Part b. If you have
not attempted Lab 1, then you should budget around 1 hour for familiarising yourself with the code
stubs we provide, the spell-checking program, command line flags, test harness and set data-type
implementation. You might want to look at the sample answers to Lab 1 to assist with this. You
should then budget 2-3 hours for the implementation itself. If you have spent more than 4 hours
on Part a, we recommend skipping ahead to Part b and only returning to Part a if you have time
remaining.
So far our solution to a fast find function has been sorting the input and in Part b we will look at
storing it in a sorted order. In this part you will take a different approach that relies on a hash
function to distribute the values to store into a fixed size array.
In this part you need to edit the program stub in your chosen language for hash sets. You should:
1. Complete the insert function for hash sets. What should you do if the value to be inserted
already exists? Hint: this is representing a set.
2. Complete the find function. Importantly, it should follow the same logic as insertion.
3. Implement the print set function – This should print out a sensible representation of the hash
set when called with the −vvv flag.
4. Implement the print stats function – This should print out a sensible set of statistics. This
should include the total number of collisions (including those incurred when re-inserting after a
rehash), total number of rehashes and the average number of collisions per access. print stats is
automatically called by the spell-checking program on completion. We will look at this output
so do not change or disable it. This will require some extra fields in the class (Java, Python) or
node struct (C).
You can find a discussion of how to test your code in the section of these instructions on Testing (after
Part b).
The hash-value(s) needed for inserting a string into your hash-table should be derived from the
string. For example, you can consider a simple summation key based on ASCII values of the individual characters, or a more sophisticated polynomial hash code, in which different letter positions are
associated with different weights. (Warning: if your algorithm is too simple, you may find
that your program is very slow when using a realistic dictionary and causes issues when
5
you perform your experimental evaluation.)
We want you to implement an open addressing hashing strategy so that collisions are dealt
with by using a collision resolution function. The simplest of these is linear probing, the most
straightforward strategy but you could also choose to implement quadratic probing or double hashing. To understand what is going on you should add code to count the number of collisions
so you can calculate the average per access in print stats. You are welcome to implement whatever
open addressing strategy you like.
The program stubs allow you to implement several collision strategies and hash functions which
can be accessed by modes already given in the stubs. The implementation you want us to mark
should run in mode 0. You may find this useful if you are attempting the extension exercise in the
report. More details of these are contained in the language specific instructions. Do not change these
since they are used in our testing processes. Write your code so that you can use the −m parameter,
which sets the mode.
An issue with open addressing is what to do when it is not possible to insert a value. To make things
simple to begin with you can simply fail in such cases. Your code should keep the num entries field
of the table up to date. Your insert function should check for a full table, and exit the program
cleanly should this occur. Once this is working you should increase (double?) the hash table size
when the table is getting full and then rehash into a larger table. Implement this resize and
rehash functionality. When the program is called using the −vvv flag it should print out
a message when it rehashes and then call the print set function to show the new state of
the hash set once rehashing completes. In C, beware of memory leaks!1
Mod of negative numbers:
Most of you are likely to use the mod operator in your hash functions by doing something
like
h = x % size
to try to get a value of h that is between 0 and size - 1.
If x is negative then the behaviour of this operator varies by language: in particular in C and Java it will return a negative number. See the discussion at
https://torstencurdt.com/tech/posts/modulo-of-negative-numbers/
You might think that x cannot be negative. However, if you are adding or multiplying large
numbers together (e.g., when hashing a string) then you are likely to get integer overflow.
In hashing, we don’t care about such overflows as it is deterministic and just adds to the
‘chaos’ a bit but Java, for instance, may throw ArrayIndexOutOfBoundsException if a
negative number is passed to the hash data structure, if you have not accounted for this
possibility in your implementation.
In C you could use an unsigned integer to make sure that you only have positive numbers.
Once you have completed this implementation you should look at the Blackboard submission form
where you will be asked the following questions about Part a.
1. What did you implement as your hash function? How does it work? How “good” is it? (No
more than 100 words).
1One can also get memory leaks in memory managed languages but these are more semantic e.g. preserving a
reference to something you will never use again.
6
2. How do find and insert for hash sets work? How have you implemented them? You do not need
to discuss the details of your collision resolution strategy here (this is the next question) but
should cover the basics of how a word is inserted into the dictionary and how a word is found
(or not) in the dictionary. (No more than 100 words, not counting any code snippets included).
3. What is your collision resolution strategy? How does it work? How have you implemented
it? Illustrate your explanation with a sample of your code - this may be tided up a bit for
presentation but should clearly be the same as the code you have submitted (No more than 200
words, not counting the code snippet).
4. Did you implement rehashing? If so briefly explain your implementation including explaining
when you choose to rehash. (No more than 200 words, not counting any code snippet included).
Part b: Binary Search Tree Lookup
Time Budget: This part of the lab should take you about 2 hours to complete. If it is taking you
more than 3 hours and you have completed a basic hash set implementation for Part a you might
want to skip forward and do the first section of Part c (which only needs the hash set implementation)
before returning to this. If you have not completed any of Part a it is worth persevering with this to
get it working before moving on to Part c where you should focus on those sections relating to binary
trees.
In this part you need to edit the program stub in your chosen language for binary search trees.
You should:
1. Complete the insert function by comparing the value to insert with the existing value.
2. Complete the find function. Importantly, it should follow the same logic as insertion.
3. Implement the print set function – This should print out a sensible representation of the hash
set when called with the −vvv flag.
4. Implement the print stats function – This should print out a sensible set of statistics, including
the height of the tree and average number of comparisons per insert and find. print stats is
automatically called by the spell-checking program on completion. We will look at this output
so do not change or disable it.
Once you have completed this implementation you should look at the Blackboard submission form
where you will be asked the following question about Part b.
1. How do find and insert for binary trees work? How have you implemented them? Include your
code for insert and find and relate your explanation to this code. (No more than 200 words, not
including the code snippet(s))
In this lab exercise we will stop there with trees. However, it is worth noting that this implementation
is not optimal. What will happen if the input dictionary is already sorted? In the next lab we will be
exploring self-balancing trees.
Testing
There are instructions in the individual language sections explaining how to test your algorithm on a
single example and you should try that first. Once that is done you can test them on some test suites.
We have provided a test script, test.py in the top level directory of the COMP26120 2023 repository and some data in the data directory. test.py takes up to five arguments: the first is the language
you are using; the second is the test suite (simple, large and collision tests are provided but
you can create more); the third is the data structure (bstree or hashset) you are testing; the fourth
7
(optional, defaults to 0) is the mode you want to test; and the fifth (optional) is the initial hash set
size (only applicable for the hashset).
For example, to run the simple tests for mode 2, your language is Java and you want to test your
binary tree implementation, you should run
./python3 test.py java simple bstree 2
You might want to edit the script to do different things but we will use the one provided when marking
your code so make sure it runs with this. We have also provided a large test (replace simple by
large above) which will take a lot longer to run (see help box below). You will need to unzip the
files by running unzip henry.zip in the data/large directory. This should be particularly useful to
check if you have implemented rehashing correctly.
Finally we have provided some collision test tests. These are dictionaries containing exactly
five items. If you run them with an initial hash set size of 5 it is highly likely that at least one insert
will result in a collision and this will allow you to check if your collision resolution mechanism is
working correctly. These can also be used to test if your rehashing implementation is working.
For example, to run the collision tests for mode 0 with python as the language, you should run
./python3 test.py python collision_tests hashset 0 5
This will run your hash set implementation on these tests with an initial hash set size of 5.
In general, to be sure that you collision resolution strategy is working, we would recommend adding
a message that gets printed out when run in verbose mode (e.g., with the -vv flag) when a collision
occurs, possibly including printing out the set after the element is inserted, and then running tests
manually using the instructions in the language specific sections to make sure at least one of your
tests is encountering a collision and it is handled correctly when it does.
Failing Tests:
If the tests are failing there are two possible explanations. Either your implementation is
wrong or the test script is being more picky than you thought.
For the first reason, make sure you understand what you expect to happen. Create a
small dictionary and input file yourself and test using these or use one of the individual
simple tests. There are nine of these and you will find them organised in the data/simple
directory where each sub-directory contains an example dictionary, input file and the
answer expected by the test. Most of these are small and easy to understand. Remember
that the program should print out all of the words that are in the input file but not in the
dictionary. Make sure that your code is connected up properly in find so that it returns
true/false correctly.
For the second reason, make sure that you don’t print anything extra at verbosity level 0. If
you look at test.py you’ll see that what it’s doing is comparing the output of your program
between “Spellchecking:” and “Usage statistics:” against some expected output. It runs the
program at verbosity level 0 and expects that only spelling errors are output in-between
those two points.
As part of marking, we will run your implementations on the simple and collision tests test
suites and may also run them on the large test suite if we feel additional testing is required.
Part c: Making a Comparison
Time Budget: If you have not attempted Lab 2 then you should budget 1 hour to familiarise
yourself with the concept of running experiments, generating data and so on. You should then expect
8
this part of the lab should take you 2-3 hours to complete. However, actually generating data and
running experiments is likely to take more time than that, but you can leave your computer running
and do something else while this is happening – particularly if you write scripts to automate data
generation and run experiments. You should plan your time with this in mind. We recommend writing
up the experiments as you go so you have something to submit if you run out of time.
You should now perform an experimental evaluation like the one you did for Lab 2. We want to
address the question.
Under what conditions are your different implementations of the dictionary data structure
preferable?
Note that for hash sets the initial size of the data structure will potentially play a role. You may
want to design your experiment so this is not a concern. You may also find it useful to correlate your
findings with statistics such as the number of collisions in the hash table.
We recommend you read, if you have not already done so, the Lab 2 instructions on how to generate
inputs for your experiments. You may also (if possible) reuse the inputs you generated as part of that
lab as part of your evaluation here in order to save time. Note that generation of inputs is likely to
take some time (hours for large dictionaries and queries), so you should plan your work so you can
set a generation script running and then leave it while you do something else.
You should perform two experiments for your evaluation and then draw a conclusion that answers
the question above. You can then devise and run a third optional (set of) experiments as an extension
exercise.
The two experiments you should perform should answer the questions:
1. Does your implementation of insert for hash sets work as expected in the average case in terms
of complexity.
2. Does your implementation of insert for binary trees work as expected in the average case in
terms of complexity.
Important: In your hypothesis, experimental design and results you should be very careful to
consider whether you are discussing or measuring a single insertion, or several insertions and you
should be consistent around this throughout. Of course, you can measure several insertions and from
that calculate the cost of a single insertion, but you should be clear about this.
Your presentation of each experiment should have four sections – these are likely to be very similar
for each experiment:
Hypothesis You should state, as a hypothesis, what you expect the performance to be in big O
notation and then write a short paragraph explaining why you believe this to be the case, based
on the theoretical complexities of insert for the data structure. Be careful about how you present
big O notation. In particular O is not a function so it doesn’t make sense to perform arithmetic
with it such as
O(n)
n
or
O(n) × n
(This section should be about 200 words and take up no more than half a page in your report).
Experimental Design You should describe how you designed the experiment. This should include:
• Your intended independent variable – this is the thing you are varying.
• Your intended dependent variable – this is the thing you are measuring whose value your
hypothesis says depends upon the value of the independent variable.
9
• Anything that you could vary but you are not – for instance to avoid confusing your results
with other variables that might also influence performance.
• What input data you generated and how many input files you used for each value of your
independent variable.
• What program you ran on what inputs. How many times you ran it on each input.
• What you measured and how.
You should also mention anything else you think is relevant that will help your marker judge
how well you designed your experiment to test your hypothesis. (This section should be about
500 words and take up no more than a page in your report – not counting any scripts included
as appendices).
Results You should present your results, ideally as a graph showing a best fit line. If you do any
processing on results, such as generating best fit lines, computing averages, etc., then these
should be described. Raw data should be presented in an appendix or included with your code
submission. You should state clearly whether your hypothesis was confirmed or refuted (or it is
equivocal or difficult to tell). If your hypothesis was refuted you should discuss why that might
be (e.g., incorrect implementation of insert, some problem with the experimental design) and
sketch how you might investigate further. (This section should be about 200 words and take up
no more than half a page in your report – not counting graphs and tables).
Note that it is more important to be able to clearly explain your experimental design (justifying
your decisions) than it is to have excellent results. The answers to the wrong question are useless.
Top marks can be obtained for Part c for a well-designed experiment that has disappointing results
(either disproving a hypothesis or just being unclear in what they show), so long as the conclusions
you draw make it clear you are aware of the issues with the results.
Disappointing/Surprising Results:
While you can get full marks for results that disprove your hypothesis, these often are
indicative of issues with the implementation or, if you are using them, with any scripts
written to automate the experiments. Therefore it is alway worthwhile double-checking these
things since problems with these will effect your marks. The commonest implementation
issue we see here is that find and/or insert are implemented in a way that returns the correct
answer but does more work than necessary. The commonest issue with automation scripts
is that they are not running the program at all, trying to run it on non-existent or empty
input files, or causing it to crash somehow. If your results seem to show that your program
takes the same amount of time on all inputs (particularly if that time is very short) then
it is worth double-checking that your automation script is actually running the program
correctly.
Once you have written up your experiments you should write two further sections.
Conclusion Use your results to address the initial question: when should you use your hash set
implementation and when should you use your binary tree implementation? You do not need to
validate this experimentally since for some implementations this is likely to take a lot of time,
even with a good implementation on a fast machine. Theoretically, there is a crossover point
where one becomes better than the other that you should be able to calculate from your existing
results so even if you can’t run large enough experiments to find this point we expect to see the
answer you have calculated.
Data Statement It is good practice to inform readers where they can find your data and code in
order to check your results and re-run experiments for themselves. This should appear in a Data
statement.
10
We’ve asked you not to include large input files in your gitlab, but you should include any scripts
you created - for instance to generate input files or run experiments, and the“raw” output data
- e.g., a table (e.g., as a .csv file or spreadsheet) of independent variable values matched with
dependent variable values. You may also include this as an appendix in the report.
The data statement should clearly state where all the input data (if included), scripts and output
data can be found.
Hints
• You should be able to do all of the analysis you need to do for this whole lab using a simple
spreadsheet application. However the unix utility gnuplot and the python matplotlib library
also both allow you to create plots from data and fit lines to those plots.
• As the numbers are all small you might want to count n in lots of 10k e.g. for an input of 10,000
say it’s n = 1, for 20,000 say it’s n = 2 etceteras. Alternatively, you may wish to measure time
in milliseconds rather than seconds. This is just to make numbers look more sensible.
• You can use the UNIX time utility to measure running time. This gives you times for real,
user and sys. The most accurate way to judge the running time of your algorithm is to take
the sum or user and sys (real includes the time when other resources might be using your
CPU and can’t account for computation time across multiple cores).
• You may want to plot/calculate average time per insert, rather than plotting the total time
taken to build the dictionary and query it for words. When doing this (particularly for Java) be
aware that there may be a fixed “overhead” time that is independent of the size of dictionary
or query – if this is the case and you are calculating averages you may see that the average time
appears to get better as the dictionary size increases not worse as you would expect. If this is
happening, you may want to start by plotting a graph of total time and determining where this
crosses the origin, in order to give you a guess at overhead, and then deduct this start up value
from the total time before calculating the average. If you do this, make sure to document it in
your report.
• If you are convinced your implementation is correct but your results don’t seem to be as expected,
particularly if you are looking at an overhead that is making time per insert hard to calculate,
you might want to investigate correlations between other factors – e.g., the relationship between
dictionary size and tree height for binary trees to see if that can illuminate what is happening.
Report Extension. As an extension activity you can include a third (set of) experiment(s) in your
report. This experiment can be to explore any aspect of the practical complexity of your implementation of hash sets or binary trees that you wish – good things to look at include: query time; the other
collision resolution strategies for hash sets (if you have implemented them); other hash functions;
and the effect of rehashing on the performance of hash sets, but there are many opportunities here.
Note that you can get 90% of the marks for this coursework without attempting the extension. This
is for students who wish to stretch themselves and should not be attempted unless the rest of the
coursework has been completed in good time.
Instructions for C Solutions
If you intend to provide a solution in C you should wo
Lab Exercise 3: Spellchecking (Better Trade-offs)
Duration: 4 weeks
You should do all your work in the lab3 directory of the COMP26120 2023 repository - see Blackboard
for further details. You will need to make use of the existing code in the branch as a starting point.
Important: You submit this lab via a quiz on Blackboard. This will:
1. Ask you some questions about your implementation, including the hash of the commit you want
us to mark (see below for details of this).
2. Ask you to upload a zip file of the python, c or java folder you have been working in.
3. Ask you to upload a PDF report of the experiments you will run in Part C of this lab.
4. Ask you for a statement on whether and (if relevant) how you used generative AI tools in your
work. If you used generative AI you also have to upload a file containing transcripts of your
usage.
You can save your answers and submit later so we recommend filling in the questions for each part as
you complete it, rather than entering everything at once at the end.
• You MUST fill in and submit the quiz (at least some of it) to get marked.
• Reports MUST be uploaded to Blackboard if you want them marked.
• Code you want marked MUST be in the commit hash you give us in the Blackboard
quiz.
We will not mark anything pushed to GitLab if we don’t have the commit hash we can
identify from Blackboard. Submission time is recorded as the time the Blackboard quiz
is submitted, not the time you last pushed to GitLab or the last time you saved the quiz.
Code Submission Details
You have the choice to complete the lab in C, Java or Python. Program stubs for this exercise exist
for each language. Only one language solution will be marked.
Because people have had a number of issues with GitLab over the years we take a multiple redundancy approach to submission of code. This involves both pushing a commit to GitLab and uploading
a zip of your code to Blackboard. By preference we will mark the code you submitted to GitLab but
1
Figure 1: Identifying the hash of your most recent commit in GitLab
Figure 2: Identifying the hashes of previous commits in GitLab
if we can’t find it or it doesn’t check out properly then we will look in the zip file on Blackboard.
Please do both to maximise the chance that one of them will work.
When you submit the assignment through Blackboard you will be asked for the hash of the commit
you want marked. This is to make sure the TAs can identify exactly which GitLab commit you want
marked.
You can find the hash of your most recent commit by looking in your repository on GitLab as
shown in figure 1.
You can also find the hash for a previous commit by clicking on the “commits” link and then
identifying the commit you are interested in. This is shown in figure 2.
Note that while the full hash for commits are quite long, we only need the first 8 characters (as
shown in the screenshots) to identify for marking.
For this assignment, you may use built-in libraries in your chosen language but bear in
mind that we ask you to implement hash sets - simply using the Java HashSet class (for
instance) will not get you the marks for implementation.
Reminder: It is bad practice to include automatically generated files in source control (e.g. your git
repositories). This applies to object files (C), class files (Java), and compiled bytecode files (Python).
2
It’s not fatal if you do this but it can cause issues during marking so try not to.
While it is fine to discuss this coursework with your friends and compare notes, the work submitted
should be your own. In particular this means you should not have copied any of the source code,
or the report. We will be using the turnitin tool to compare reports for similarities.
Generative AI You may use generative AI for this assignment. If you choose to do so you should
read the guidance notes and watch the guidance video in the Blackboard folder for the lab. At the
start of the quiz you are asked to make a statement about your use of generative AI. If you leave
this blank it will be understood as a declaration that you have not used generative AI
for the coursework. The statement should make the following things clear:
1. Which generative AI tool you used.
2. What parts of the coursework you used it for: if you used it for the code was it for all the code
or only some of it? If just some, which bits? If you used it to help you answer quiz questions,
which quiz questions? if you used it to help with the report, which parts of the report?
3. What steps you took to ensure that the generative AI tool had provided correct information.
4. What evidence there is either in the statement, or in your coursework submission, that demonstrates you have met the coursework learning outcomes.
If you used generative AI you should also upload a zip file containing the transcripts of your usage in
question 2 of the Blackboard quiz.
3
Learning Objectives
By the end of this lab you will be able to:
• Explain the role of the hash function in the implementation of hash tables and describe at least
one hash collision strategy.
• Explain the basic structure of a binary search tree and the basic operations on this structure.
• Produce C, Java, or Python code to implement the above concepts.
• Evaluate experimentally the behaviour of hash sets and binary search trees.
Introduction
The aim of this exercise is to use the context of a simple spell-checking program to explore the hash
table and binary search tree data structures.
This exercise builds on work in Labs 1 and 2. If you have not done these labs we recommend you
take a look at them and their model solutions in order to familiarise yourself with the spell-checking
program and our expectations for creating and writing up experiments.
Getting the Support Code
You should be able to see a lab3 folder in your Gitlab that contains all the support files for this lab.
You can find instructions for accessing your gitlab on Blackboard.
Data Structures
The spell-checking program stores a dictionary of words using a Set datatype. There are two data
structures used to implement this Set datatype in this lab. In Lab 1 we introduced this datatype
but we repeat that information here. You may also need to look online (e.g. the Wikipedia page) to
complete your knowledge.
Set. This is an abstract datatype that is used to store the dictionary of words in our spell-checking
program. The operations required for this application are:
• find whether a given value (i.e. string, alphabetic word) appears in the set.
• insert a given value in the set. Note that there is no notion of multiplicity in sets - a value
either appears or it does not. Therefore, if insert is called with a duplicate value (i.e. it is
already in the set) it can be ignored.
There would usually also be a remove function but this is not required for this application. We also
include two further utility functions, which we ask you to implement, that are useful for printing what
is happening :
• print set to list the contents of a Set.
• print stats to output statistical information to show how well your code is working.
You should have already used a dynamic array to implement this dataype in Lab 1.
In this exercise we use hash tables and binary search trees to implement the Set datatype. These
have been introduced in lectures and we describe these briefly below. You may want to look at the
recommended textbooks or online to complete your knowledge.
4
Hash Table. The underlying memory structure for a hash table is an array. A hash function is
then used to map from a large ‘key’ domain to the (much smaller) set of indices into the array, where
a value can be stored. When two values in the input domain map to the same index in the array this
is called a collision and there are multiple ways to resolve these collisions. To use a hash table to
represent a set we make the key and value the same - the result is usually called a hash set.
Binary Search Tree. You can think of a tree as a linked list where every node has two ‘children’.
This allows us to differentiate between what happens in each sub-tree. In a binary search tree the
general idea is that the ‘left’ sub-tree holds values smaller than that found in the current node, and
the ‘right’ sub-tree holds larger values.
Lab 3: Better Storage
In Lab 1 we achieved a faster find function by first sorting the input. In this part we explore two data
structures that organise the data in a way that should make it faster to perform the find operation.
Part a: Hash Table Lookup
Time Budget: Students generally find Part a of this lab more challenging than Part b. If you have
not attempted Lab 1, then you should budget around 1 hour for familiarising yourself with the code
stubs we provide, the spell-checking program, command line flags, test harness and set data-type
implementation. You might want to look at the sample answers to Lab 1 to assist with this. You
should then budget 2-3 hours for the implementation itself. If you have spent more than 4 hours
on Part a, we recommend skipping ahead to Part b and only returning to Part a if you have time
remaining.
So far our solution to a fast find function has been sorting the input and in Part b we will look at
storing it in a sorted order. In this part you will take a different approach that relies on a hash
function to distribute the values to store into a fixed size array.
In this part you need to edit the program stub in your chosen language for hash sets. You should:
1. Complete the insert function for hash sets. What should you do if the value to be inserted
already exists? Hint: this is representing a set.
2. Complete the find function. Importantly, it should follow the same logic as insertion.
3. Implement the print set function – This should print out a sensible representation of the hash
set when called with the −vvv flag.
4. Implement the print stats function – This should print out a sensible set of statistics. This
should include the total number of collisions (including those incurred when re-inserting after a
rehash), total number of rehashes and the average number of collisions per access. print stats is
automatically called by the spell-checking program on completion. We will look at this output
so do not change or disable it. This will require some extra fields in the class (Java, Python) or
node struct (C).
You can find a discussion of how to test your code in the section of these instructions on Testing (after
Part b).
The hash-value(s) needed for inserting a string into your hash-table should be derived from the
string. For example, you can consider a simple summation key based on ASCII values of the individual characters, or a more sophisticated polynomial hash code, in which different letter positions are
associated with different weights. (Warning: if your algorithm is too simple, you may find
that your program is very slow when using a realistic dictionary and causes issues when
5
you perform your experimental evaluation.)
We want you to implement an open addressing hashing strategy so that collisions are dealt
with by using a collision resolution function. The simplest of these is linear probing, the most
straightforward strategy but you could also choose to implement quadratic probing or double hashing. To understand what is going on you should add code to count the number of collisions
so you can calculate the average per access in print stats. You are welcome to implement whatever
open addressing strategy you like.
The program stubs allow you to implement several collision strategies and hash functions which
can be accessed by modes already given in the stubs. The implementation you want us to mark
should run in mode 0. You may find this useful if you are attempting the extension exercise in the
report. More details of these are contained in the language specific instructions. Do not change these
since they are used in our testing processes. Write your code so that you can use the −m parameter,
which sets the mode.
An issue with open addressing is what to do when it is not possible to insert a value. To make things
simple to begin with you can simply fail in such cases. Your code should keep the num entries field
of the table up to date. Your insert function should check for a full table, and exit the program
cleanly should this occur. Once this is working you should increase (double?) the hash table size
when the table is getting full and then rehash into a larger table. Implement this resize and
rehash functionality. When the program is called using the −vvv flag it should print out
a message when it rehashes and then call the print set function to show the new state of
the hash set once rehashing completes. In C, beware of memory leaks!1
Mod of negative numbers:
Most of you are likely to use the mod operator in your hash functions by doing something
like
h = x % size
to try to get a value of h that is between 0 and size - 1.
If x is negative then the behaviour of this operator varies by language: in particular in C and Java it will return a negative number. See the discussion at
https://torstencurdt.com/tech/posts/modulo-of-negative-numbers/
You might think that x cannot be negative. However, if you are adding or multiplying large
numbers together (e.g., when hashing a string) then you are likely to get integer overflow.
In hashing, we don’t care about such overflows as it is deterministic and just adds to the
‘chaos’ a bit but Java, for instance, may throw ArrayIndexOutOfBoundsException if a
negative number is passed to the hash data structure, if you have not accounted for this
possibility in your implementation.
In C you could use an unsigned integer to make sure that you only have positive numbers.
Once you have completed this implementation you should look at the Blackboard submission form
where you will be asked the following questions about Part a.
1. What did you implement as your hash function? How does it work? How “good” is it? (No
more than 100 words).
1One can also get memory leaks in memory managed languages but these are more semantic e.g. preserving a
reference to something you will never use again.
6
2. How do find and insert for hash sets work? How have you implemented them? You do not need
to discuss the details of your collision resolution strategy here (this is the next question) but
should cover the basics of how a word is inserted into the dictionary and how a word is found
(or not) in the dictionary. (No more than 100 words, not counting any code snippets included).
3. What is your collision resolution strategy? How does it work? How have you implemented
it? Illustrate your explanation with a sample of your code - this may be tided up a bit for
presentation but should clearly be the same as the code you have submitted (No more than 200
words, not counting the code snippet).
4. Did you implement rehashing? If so briefly explain your implementation including explaining
when you choose to rehash. (No more than 200 words, not counting any code snippet included).
Part b: Binary Search Tree Lookup
Time Budget: This part of the lab should take you about 2 hours to complete. If it is taking you
more than 3 hours and you have completed a basic hash set implementation for Part a you might
want to skip forward and do the first section of Part c (which only needs the hash set implementation)
before returning to this. If you have not completed any of Part a it is worth persevering with this to
get it working before moving on to Part c where you should focus on those sections relating to binary
trees.
In this part you need to edit the program stub in your chosen language for binary search trees.
You should:
1. Complete the insert function by comparing the value to insert with the existing value.
2. Complete the find function. Importantly, it should follow the same logic as insertion.
3. Implement the print set function – This should print out a sensible representation of the hash
set when called with the −vvv flag.
4. Implement the print stats function – This should print out a sensible set of statistics, including
the height of the tree and average number of comparisons per insert and find. print stats is
automatically called by the spell-checking program on completion. We will look at this output
so do not change or disable it.
Once you have completed this implementation you should look at the Blackboard submission form
where you will be asked the following question about Part b.
1. How do find and insert for binary trees work? How have you implemented them? Include your
code for insert and find and relate your explanation to this code. (No more than 200 words, not
including the code snippet(s))
In this lab exercise we will stop there with trees. However, it is worth noting that this implementation
is not optimal. What will happen if the input dictionary is already sorted? In the next lab we will be
exploring self-balancing trees.
Testing
There are instructions in the individual language sections explaining how to test your algorithm on a
single example and you should try that first. Once that is done you can test them on some test suites.
We have provided a test script, test.py in the top level directory of the COMP26120 2023 repository and some data in the data directory. test.py takes up to five arguments: the first is the language
you are using; the second is the test suite (simple, large and collision tests are provided but
you can create more); the third is the data structure (bstree or hashset) you are testing; the fourth
7
(optional, defaults to 0) is the mode you want to test; and the fifth (optional) is the initial hash set
size (only applicable for the hashset).
For example, to run the simple tests for mode 2, your language is Java and you want to test your
binary tree implementation, you should run
./python3 test.py java simple bstree 2
You might want to edit the script to do different things but we will use the one provided when marking
your code so make sure it runs with this. We have also provided a large test (replace simple by
large above) which will take a lot longer to run (see help box below). You will need to unzip the
files by running unzip henry.zip in the data/large directory. This should be particularly useful to
check if you have implemented rehashing correctly.
Finally we have provided some collision test tests. These are dictionaries containing exactly
five items. If you run them with an initial hash set size of 5 it is highly likely that at least one insert
will result in a collision and this will allow you to check if your collision resolution mechanism is
working correctly. These can also be used to test if your rehashing implementation is working.
For example, to run the collision tests for mode 0 with python as the language, you should run
./python3 test.py python collision_tests hashset 0 5
This will run your hash set implementation on these tests with an initial hash set size of 5.
In general, to be sure that you collision resolution strategy is working, we would recommend adding
a message that gets printed out when run in verbose mode (e.g., with the -vv flag) when a collision
occurs, possibly including printing out the set after the element is inserted, and then running tests
manually using the instructions in the language specific sections to make sure at least one of your
tests is encountering a collision and it is handled correctly when it does.
Failing Tests:
If the tests are failing there are two possible explanations. Either your implementation is
wrong or the test script is being more picky than you thought.
For the first reason, make sure you understand what you expect to happen. Create a
small dictionary and input file yourself and test using these or use one of the individual
simple tests. There are nine of these and you will find them organised in the data/simple
directory where each sub-directory contains an example dictionary, input file and the
answer expected by the test. Most of these are small and easy to understand. Remember
that the program should print out all of the words that are in the input file but not in the
dictionary. Make sure that your code is connected up properly in find so that it returns
true/false correctly.
For the second reason, make sure that you don’t print anything extra at verbosity level 0. If
you look at test.py you’ll see that what it’s doing is comparing the output of your program
between “Spellchecking:” and “Usage statistics:” against some expected output. It runs the
program at verbosity level 0 and expects that only spelling errors are output in-between
those two points.
As part of marking, we will run your implementations on the simple and collision tests test
suites and may also run them on the large test suite if we feel additional testing is required.
Part c: Making a Comparison
Time Budget: If you have not attempted Lab 2 then you should budget 1 hour to familiarise
yourself with the concept of running experiments, generating data and so on. You should then expect
8
this part of the lab should take you 2-3 hours to complete. However, actually generating data and
running experiments is likely to take more time than that, but you can leave your computer running
and do something else while this is happening – particularly if you write scripts to automate data
generation and run experiments. You should plan your time with this in mind. We recommend writing
up the experiments as you go so you have something to submit if you run out of time.
You should now perform an experimental evaluation like the one you did for Lab 2. We want to
address the question.
Under what conditions are your different implementations of the dictionary data structure
preferable?
Note that for hash sets the initial size of the data structure will potentially play a role. You may
want to design your experiment so this is not a concern. You may also find it useful to correlate your
findings with statistics such as the number of collisions in the hash table.
We recommend you read, if you have not already done so, the Lab 2 instructions on how to generate
inputs for your experiments. You may also (if possible) reuse the inputs you generated as part of that
lab as part of your evaluation here in order to save time. Note that generation of inputs is likely to
take some time (hours for large dictionaries and queries), so you should plan your work so you can
set a generation script running and then leave it while you do something else.
You should perform two experiments for your evaluation and then draw a conclusion that answers
the question above. You can then devise and run a third optional (set of) experiments as an extension
exercise.
The two experiments you should perform should answer the questions:
1. Does your implementation of insert for hash sets work as expected in the average case in terms
of complexity.
2. Does your implementation of insert for binary trees work as expected in the average case in
terms of complexity.
Important: In your hypothesis, experimental design and results you should be very careful to
consider whether you are discussing or measuring a single insertion, or several insertions and you
should be consistent around this throughout. Of course, you can measure several insertions and from
that calculate the cost of a single insertion, but you should be clear about this.
Your presentation of each experiment should have four sections – these are likely to be very similar
for each experiment:
Hypothesis You should state, as a hypothesis, what you expect the performance to be in big O
notation and then write a short paragraph explaining why you believe this to be the case, based
on the theoretical complexities of insert for the data structure. Be careful about how you present
big O notation. In particular O is not a function so it doesn’t make sense to perform arithmetic
with it such as
O(n)
n
or
O(n) × n
(This section should be about 200 words and take up no more than half a page in your report).
Experimental Design You should describe how you designed the experiment. This should include:
• Your intended independent variable – this is the thing you are varying.
• Your intended dependent variable – this is the thing you are measuring whose value your
hypothesis says depends upon the value of the independent variable.
9
• Anything that you could vary but you are not – for instance to avoid confusing your results
with other variables that might also influence performance.
• What input data you generated and how many input files you used for each value of your
independent variable.
• What program you ran on what inputs. How many times you ran it on each input.
• What you measured and how.
You should also mention anything else you think is relevant that will help your marker judge
how well you designed your experiment to test your hypothesis. (This section should be about
500 words and take up no more than a page in your report – not counting any scripts included
as appendices).
Results You should present your results, ideally as a graph showing a best fit line. If you do any
processing on results, such as generating best fit lines, computing averages, etc., then these
should be described. Raw data should be presented in an appendix or included with your code
submission. You should state clearly whether your hypothesis was confirmed or refuted (or it is
equivocal or difficult to tell). If your hypothesis was refuted you should discuss why that might
be (e.g., incorrect implementation of insert, some problem with the experimental design) and
sketch how you might investigate further. (This section should be about 200 words and take up
no more than half a page in your report – not counting graphs and tables).
Note that it is more important to be able to clearly explain your experimental design (justifying
your decisions) than it is to have excellent results. The answers to the wrong question are useless.
Top marks can be obtained for Part c for a well-designed experiment that has disappointing results
(either disproving a hypothesis or just being unclear in what they show), so long as the conclusions
you draw make it clear you are aware of the issues with the results.
Disappointing/Surprising Results:
While you can get full marks for results that disprove your hypothesis, these often are
indicative of issues with the implementation or, if you are using them, with any scripts
written to automate the experiments. Therefore it is alway worthwhile double-checking these
things since problems with these will effect your marks. The commonest implementation
issue we see here is that find and/or insert are implemented in a way that returns the correct
answer but does more work than necessary. The commonest issue with automation scripts
is that they are not running the program at all, trying to run it on non-existent or empty
input files, or causing it to crash somehow. If your results seem to show that your program
takes the same amount of time on all inputs (particularly if that time is very short) then
it is worth double-checking that your automation script is actually running the program
correctly.
Once you have written up your experiments you should write two further sections.
Conclusion Use your results to address the initial question: when should you use your hash set
implementation and when should you use your binary tree implementation? You do not need to
validate this experimentally since for some implementations this is likely to take a lot of time,
even with a good implementation on a fast machine. Theoretically, there is a crossover point
where one becomes better than the other that you should be able to calculate from your existing
results so even if you can’t run large enough experiments to find this point we expect to see the
answer you have calculated.
Data Statement It is good practice to inform readers where they can find your data and code in
order to check your results and re-run experiments for themselves. This should appear in a Data
statement.
10
We’ve asked you not to include large input files in your gitlab, but you should include any scripts
you created - for instance to generate input files or run experiments, and the“raw” output data
- e.g., a table (e.g., as a .csv file or spreadsheet) of independent variable values matched with
dependent variable values. You may also include this as an appendix in the report.
The data statement should clearly state where all the input data (if included), scripts and output
data can be found.
Hints
• You should be able to do all of the analysis you need to do for this whole lab using a simple
spreadsheet application. However the unix utility gnuplot and the python matplotlib library
also both allow you to create plots from data and fit lines to those plots.
• As the numbers are all small you might want to count n in lots of 10k e.g. for an input of 10,000
say it’s n = 1, for 20,000 say it’s n = 2 etceteras. Alternatively, you may wish to measure time
in milliseconds rather than seconds. This is just to make numbers look more sensible.
• You can use the UNIX time utility to measure running time. This gives you times for real,
user and sys. The most accurate way to judge the running time of your algorithm is to take
the sum or user and sys (real includes the time when other resources might be using your
CPU and can’t account for computation time across multiple cores).
• You may want to plot/calculate average time per insert, rather than plotting the total time
taken to build the dictionary and query it for words. When doing this (particularly for Java) be
aware that there may be a fixed “overhead” time that is independent of the size of dictionary
or query – if this is the case and you are calculating averages you may see that the average time
appears to get better as the dictionary size increases not worse as you would expect. If this is
happening, you may want to start by plotting a graph of total time and determining where this
crosses the origin, in order to give you a guess at overhead, and then deduct this start up value
from the total time before calculating the average. If you do this, make sure to document it in
your report.
• If you are convinced your implementation is correct but your results don’t seem to be as expected,
particularly if you are looking at an overhead that is making time per insert hard to calculate,
you might want to investigate correlations between other factors – e.g., the relationship between
dictionary size and tree height for binary trees to see if that can illuminate what is happening.
Report Extension. As an extension activity you can include a third (set of) experiment(s) in your
report. This experiment can be to explore any aspect of the practical complexity of your implementation of hash sets or binary trees that you wish – good things to look at include: query time; the other
collision resolution strategies for hash sets (if you have implemented them); other hash functions;
and the effect of rehashing on the performance of hash sets, but there are many opportunities here.
Note that you can get 90% of the marks for this coursework without attempting the extension. This
is for students who wish to stretch themselves and should not be attempted unless the rest of the
coursework has been completed in good time.
Instructions for C Solutions
If you intend to provide a solution in C you should wo