代写COSC 2123/1285 Algorithms and Analysis
- 首页 >> CSCOSC 2123/1285 Algorithms and Analysis
Assignment 1: Word Completion
Assessment Type (Group of 1 or 2) Assignment. Submit online via Canvas → Assignments → Assignment 1. Clarifications/Updates/FAQ: check
Canvas → Discussion → Assignment 1 Discussion.
Due Date Week 7, 11:59pm, Friday, September 08, 2023
Marks 30
1 Objectives
There are a number of key objectives for this assignment:
• Understand how a real-world problem can be implemented by different data structures and/or
algorithms.
• Evaluate and contrast the performance of the data structures and/or algorithms with respect to
different usage scenarios and input data.
In this assignment, we focus on the word completion problem.
2 Background
Word/sentence (auto-)completion is a very handy feature of nowadays text editors and email browsers
(you must have used it in your Outlook). While sentence completion is a much more challenging task
and perhaps requires advanced learning methods, word completion is much easier to do as long as
you have a dictionary available in the memory. In this assignment, we will focus on implementing a
dictionary comprising of words and their frequencies that allows word completion. We will try several
data structures and compare their performances, which are array (that is, Python list), linked list,
and trie, which are described one by one below. Please read them very carefully. Latest updates and
answers for questions regarding this assignment will be posted on the Discussion Forum. It is your
responsibility to check the post frequently for important updates.
Array-Based Dictionary
Python’s built-in ‘list’ is equivalent to ‘array’ in other language. In fact, it is a dynamic array in the
sense that its resizing operation (when more elements are inserted into the array than the original
size) is managed automatically by Python. You can initialize an empty array in Python, add elements
at the end of the array, remove the first instant of a given value by typing the following commands
(e.g., on Python’s IDLE Shell).
>>> array = []
>>> array.append(5)
>>> array.append(10)
>>> array.append(5)
>>> array
[5, 10, 5]
>>> array.remove(5)
>>> array
[10, 5]
In the array-based dictionary implementation, we use the Python list (a data structure) to implement common operations for a dictionary (an abstract data type). We treat each element of the
array as a pair (word, frequency) (defined as an object of the simple class WordFrequency), where
word is an English word (a string), e.g., ‘ant’, and frequency is its usage frequency (a non-negative
integer), e.g., 1000, in a certain context, e.g., in some forums or social networks.
The array must be sorted in the alphabetical order, i.e., ‘ant’ appears before ‘ape’, to facilitate
search. A new word, when added to the dictionary, should be inserted at a correct location that
preserves the alphabetical order (using the module bisect is allowed - but you need to know what
it does). An example of a valid array is [(‘ant’, 1000), (‘ape’, 200), (‘auxiliary’, 2000)].
Adding (‘app’, 500) to the array, we will have [(‘ant’, 1000), (‘ape’, 200), (‘app’, 500),
(‘auxiliary’, 2000)]. Note that the pair (‘ant’, 1000) in our actual implementation should be
an object of the class WordFrequency and not a tuple.
A Search for ‘ape’ from the array above should return its frequency 200. If the word doesn’t exist,
0 is returned.
A Delete for a word in the dictionary should return True and remove the word from the dictionary
if it exists, and return False otherwise.
An Autocompletion for a string should return a sorted list (most frequent words appear first) of
the three words (if any) of highest frequencies in the dictionary that have the given string as a prefix.
For the array above, an autocompletion for ‘ap’ should return the list [(‘app’, 500),(‘ape’, 200)].
Notice that while both ‘app’ and ‘ape’ have ‘ap’ as a prefix, ‘app’ has a larger frequency and appears
first in the returned list of autocompletion.
Linked-List-Based Dictionary
A linked list is precisely what it is called: a list of nodes linked together by references. In a singly
linked list, each node consists of a data item, e.g., a string or a number, and a reference that holds the
memory location of the next node in the list (the reference in the last node is set to Null). Each linked
list has a head, which is the reference holding memory location of the first node in the list. Once we
know the head of the list, we can access all nodes sequentially by going from one node to the next
using references until reaching the last node.
In the linked-list-based implementation of dictionary, we use an unsorted singly linked list. You
can use the implementation of the linked list in the workshop as a reference for your implementation.
Each node stores as data a pair of (word, frequency) (an object of the class WordFrequency) and
a reference to the next node. A word and its frequency are added as a new node at the front of the
list by updating the head reference. Apart from the fact that they are carried out in the linked list,
Search, Delete, and Autocomplete work similarly as in the array-based implementation. Note that
unlike the array-based dictionary, the words in the linked list are not sorted.
2
Trie-Based Dictionary
Trie (pronounced as either ‘tree’ or ‘try’) is a data structure storing (key, value) pairs where keys
are strings that allows fast operations such as spell checking and auto-completion. Introduced in the
context of computer decades ago, it is no longer the most efficient data structure around. However,
our purpose is to focus more on the understanding of what data structures mean, how they can be
used to implement an abstract data type, and to empirically evaluate their performance. Thus, we
stick to the original/simple idea of ‘trie’. You are strongly encouraged to read about more advanced
data structures evolving from trie.
Each node of a trie contains the following fields:
• a lower-case letter from the English alphabet (‘a’ to ‘z’), or Null if it is the root,
• a boolean variable that is True if this letter is the last letter of a word in the dictionary and False
otherwise,
• a positive integer indicating the word’s frequency (according to some dataset) if the letter is the
last letter of a word in the dictionary,
• an array of A = 26 elements (could be Null) storing references pointing to the children nodes.
In our implementation, for convenience, we use a hashtable/Python’s dictionary to store the
children.
As an example, consider Figure 1.
Figure 1: An example of a trie storing six words and their frequencies. The boolean value True
indicates that the letter is the end of a word. In that case, a frequency (an integer) is shown, e.g., 10
for ‘cut’. Note that a word can be a prefix of another, e.g., ‘cut’ is a prefix of ‘cute’.
Construction. A trie can be built by simply adding words to the tree one by one (order of words
being added is not important). If a new word is the prefix of an existing word in the trie then we can
3
simply change the boolean field of the node storing the last letter to True and update its frequency.
For instance, in the example in Figure 1, if (‘cut’, 10) is added after (‘cute’, 50), then one can simply
change the boolean field of the node containing the letter ‘t’ to True and set the frequency to 10,
signifying that now (‘cut’, 10) is part of the tree. In another case, when (‘cup’, 30) is added, a new
node has to be constructed to store ‘p’, which has a True in the boolean field and 30 as its frequency.
In this case, ‘cup’ can reuse the two nodes storing ‘c’ and ‘u’ from the previous words.
Searching. To search for a word (and to get its frequency) in a trie, we use its letters to navigate
the tree by following the corresponding child node. The search is successful if we can reach a node
storing the last letter in the word and has the boolean field True. In that case, the frequency stored
in this node is returned. The search fails, that is, the word is not in the tree, if either a) the search
algorithm couldn’t find a child node that matches a letter in the word, or b) it finds all the nodes
matching all the letters of the words but the boolean field of the node corresponding to the last letter
is False.
Deletion. The deletion succeeds if the word is already included in the tree. If the word is a prefix
of another word then it can be deleted by simply setting the boolean field of the node storing the last
letter to False. For example, if (cut, 10) is to be deleted from the trie in Figure 1, then we only need
to change the boolean field of the node storing ‘t’ to False. Otherwise, if the word has a unique suffix,
then (only) nodes corresponding to the suffix are to be deleted. For example, if (cup, 30) is to be
removed, then the node storing the last letter ‘p’ must be deleted from the trie to save space but not
‘c’ and ‘u’ because these still form part of other words.
Auto-completion. Auto-completion returns a list of three words (if any) in the dictionary (trie)
of highest frequencies that have the given string as a prefix. For example, in the trie given in Figure 1,
• the auto-completion list for ‘cu’ is: [(cute, 50), (cup, 30), (cut, 10)],
• the auto-completion list for ‘far’ is: [(farm, 40)].
Suppose we add one more word (curiosity, 60) into this tree, then the auto-completion list of ‘cu’ will
be changed to [[(curiosity, 60), (cute, 50), (cup, 30)]. In this example, although ‘cut’ contains ‘cu’ as
a prefix, it is not in the top three of the most common words having ‘cu’ as a prefix. In general, the
auto-completion list contains either three, two, one, or no words. They must be sorted in decreasing
frequencies, that is, the first word has the highest frequency and the last has the lowest frequency.
3 Tasks
The assignment is broken up into a number of tasks, to help you progressively complete the assignment.
3.1 Task A: Implement the Dictionary and Its Operations Using Array, Linked
List, and Trie (12 marks)
In this task, you will implement a dictionary of English words that allows Add, Search, Delete,
and Auto-completion, using three different data structures: Array (Python’s list), Linked List, and
Trie. Each implementation should support the following operations:
• Build a dictionary from a list of words and frequencies: create a dictionary that stores words
and frequencies taken from a given list. This operation is not tested.
• (A)dd a word and its frequency to the dictionary. Return True if successful or False if it already
exists in the dictionary.
• (S)earch for a word in a dictionary and return its frequency. Return 0 if not found.
4
• (D)elete a word from the dictionary. Returns True if successful and False if it doesn’t exist in
the dictionary.
• (AC)Auto-complete a given string and return a list of three most frequent words (if any) in the
dictionary that have the string as a prefix. The list can be empty.
3.1.1 Implementation Details
Array-Based Dictionary. In this subtask, you will implement the dictionary using Python’s lists,
which are equivalent to arrays in other programming languages. In this implementation, all standard
operations on lists are allowed. Other data structures should NOT be used directly in the main
operations of the array-based dictionary. The usage of the module ‘bisect’ is allowed. Other data
structures should NOT be used directly in the main operations of the array-based dictionary. See the
Background Section for more details and an example.
Linked-List-Based Dictionary. In this subtask, you will implement the dictionary by using a
singly linked list. Other data structures should NOT be used directly in the main operations of the
array-based dictionary (but Python’s list can be used to store intermediate data or the input/output).
See the Background Section for more details and an example.
Trie-Based Dictionary. In this subtask, you will implement the dictionary using the trie data
structure. Both iterative or recursive implementations are acceptable. See the Background Section
for more details and an example.
3.1.2 Operations Details
Operations to perform on the implementations are specified on the command file. They are in the
following format:
where operation is one of {S, A, D, AC} and arguments is for optional arguments of some of the
operations. The operations take the following form:
• S word – searches for word in the dictionary and returns its frequency (returns 0 if not found).
• A word frequency – adds a new word and its frequency to the dictionary, returns True if
succeeded and False if the word already exists in the dictionary.
• D word – deletes word from the dictionary. If fails to delete (word is not in the dictionary),
returns False. Unneeded nodes (after deletion) must be removed.
• AC partial word – returns a list of three words of highest frequencies in the dictionary that has
partial word as a prefix. These words should be listed in a decreasing order of frequencies.
Maximum three words and minimum zero word will be returned.
As an example of the operations, consider the input and output from the provided testing files,
e.g., sampleDataToy.txt, testToy.in, and the expected output, testToy.exp (Table 1).
Note, you do NOT have to do the input and output reading yourself. The provided Python files will
handle the necessary input and output formats. Your task is to implement the missing methods in
the provided classes.
5
sampleDataToy.txt testToy.in (commands) testToy.exp (expected output)
cute 10 S cute Found ‘cute’ with frequency 10
ant 20 D cute Delete ‘cute’ succeeded
cut 30 S cute NOT Found ‘cute’
cuts 50 S book NOT Found ‘book’
apple 300 A book 10000 Add ’book’ succeeded
cub 15 S book Found ‘book’ with frequency 10000
courage 1000 S apple Found ‘apple’ with frequency 300
annotation 5 D apple Delete ‘apple’ succeeded
further 40 S apple NOT Found ‘apple’
furniture 500 D apple Delete ‘apple’ failed
find 400 AC c Autocomplete for ‘c’: [ calm: 1000 cuts: 50 cut: 30 ]
farm 5000 AC cut Autocomplete for ‘cut’: [ cuts: 50 cut: 30 ]
farming 1000 D cut Delete ‘cut’ succeeded
farmer 300 AC cut Autocomplete for ‘cut’: [ cuts: 50 ]
appendix 10 AC farms Autocomplete for ‘farms’: [ ]
apology 600
apologetic 1000
fur 10
fathom 40
apps 60
Table 1: The file sampleDataToy.txt provides the list of words and frequencies for the dictionary,
while testToy.in and testToy.exp have the list of input commands and expected output. For
instance, as ‘cute’ belongs to the dictionary, the expected outcome of “S cute” should be “Found
‘cute’ with frequency 10” and “D cute” should be successful. After deleting ‘cute’, “S cute”
should return “NOT Found ‘cute’”. Note that although there are more than three words having ‘c’
as a prefix, “AC c” only returns the three words of highest frequencies. Also, “AC cut” returns “[
cuts: 50 cut: 30 ]”, but after deleting ‘cut’, it must return “[ cuts: 50 ]” only.
3.1.3 Testing Framework
We provide Python skeleton codes (see Table 2) to help you get started and automate the correctness
testing. You may add your own Python modules to your final submission, but please ensure that they
work with the supplied modules and the Python test script.
Debugging. To run the code, from the directory where dictionary file based.py is, execute
(use ‘python3’ on Linux, ‘python’ on Pycharm):
> python3 dictionary_file_based.py