program编程设计辅导、辅导Python程序语言、Python设计编程调试
- 首页 >> Matlab编程 Untitled 1
Purpose
The purpose of this assignment is to give you practise working with files, building and using
dictionaries, designing functions, and writing unit tests.
Introduction: The Twitterverse
Twitter is a social networking website where users can post very short messages know as
"tweets". Each Twitter user can choose to "follow" other users, which means that they see
those users' tweets. A Twitter user sees the tweets of users they are "following", and their
tweets are seen by their "followers" (the users who follow them).
All the "follow" connections define a network among Twitter users, and it's quite interesting to
look for patterns in the connections. Tools like Twiangulate let you explore questions like
"what connections do my two friends have in common?". In this assignment, you'll write a
program that lets you ask questions (or "queries") about a Twitter dataset.
Any tool for exploring the Twitterverse must get its data from Twitter itself. Twitter
provides an API to allow programmers to write programs that interact with Twitter and extract
data from it. In general, an API is a module that defines functions for accessing underlying
data and performing other tasks without having to know how that data is actually stored and
retrieved.
To make this assignment more manageable for you, we will assume that the information we
need has already been extracted from Twitter and stored in a file.
How To Tackle This Assignment
This is your first experience designing a program of this size. We are providing detailed
advice to help you break the task down into manageable pieces.
Make sure your twitterverse_functions.py module runs without error before submitting. If you
have syntax errors in your module, comment them out before submitting, or we will not be
able to test your functions! Submitted programs that have syntax errors, do not
execute/run, or infinitely loop are subject to a grade of 0.
The Twitter Data File
A Twitter data file contains a series of one or more user profiles, one after the other. Each user
profile has the following elements, in this order:
A line containing a non-blank, non-empty username. You may assume that usernames are
unique, that is, a single username will not occur more than once in the file, and that
usernames do not contain any whitespace.
A line for the user's actual name. If they did not provide a name, this line will be blank.
A line for the user's location, or a blank line if they did not provide one.
A line for the URL of a website, or a blank line if they did not provide one.
Untitled 2
Zero or more lines for the user's bio, then a line with nothing but the keyword ENDBIO on it.
This marks the end of the bio, and is not considered part of it. You may assume that no
bio has the string ENDBIO within it.) If the user did not provide a bio, the ENDBIO line will
come immediately after the website line, with no blank line in between.
Zero or more lines each containing the username of someone that this user is following,
then a line with the keyword END on it. You may assume that no one has END as their
username.) A user cannot be on his or her own following list. You may assume that every
user on a following list has a user profile in the Twitter data file.
Notice that the keywords act as separators in this file. All of their letters are capitalized, and
the keywords contain no punctuation.
Examples
Here is a sample user profile that might occur among many in a file:
tomCruise
Tom Cruise
Los Angeles, CA
http://www.tomcruise.com
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
ENDBIO
katieH
NicoleKidman
END
The file data.txt is a smallish example of a complete Twitter data file (and was made by hand)
and the file rdata.txt is a much larger example (and is made from real data extracted from
Twitter). These should help you confirm your understanding of the file format and will also be
useful in testing your program.
Cycles in the data
Although a user cannot be on their own following or followers lists, there can be "loops" (we
call them "cycles") such as this: user A can be following B who is following A. This is the
shortest possible cycle. Of course, cycles can be longer.
The Query File
Note that the word "query" just means "question". In computer science, we use it to mean a
request for information. For this assignment, a query will be provided in a file. Below we will
review the high level parts of the query, look at an example, and then describe the format of
the query file.
Overview
A query has three components: a search specification, a filter specification, and a presentation
specification.
The search specification describes how to generate a list of Twitter usernames, starting with
an initial username (a list of length one) and then finding their followers or people they are
Untitled 3
following, then people that are those people's followers or who they are following, and so on.
When processing the search specification, don't try to do anything to avoid cycles. For
instance, if the search specification says to find the people who user A is following, and from
there the people they are following, you could find yourself back at user A. Don't try to avoid
that.
After processing the search specification, we have a list of Twitter usernames. Its length could
be zero. For example, if the initial username is 'adalovelace' and the search specification
contains a single 'followers' keyword, then the length of the list will be zero
if 'adalovelace' has no followers.
The filter specification describes how to filter the list of usernames produced by the search
specification. The filtering can be based on
whether or not they are following a particular user,
whether or not a particular user is their follower,
whether their name contains a particular string (case-insensitive), or
whether their location contains a particular string (case-insensitive).
After processing the filter specification, we have a possibly reduced list of usernames.
Once the search results have been found and filtered, the presentation
specification describes how the output should be presented. It specifies on what basis the
results should be sorted, and whether the results should be presented in a short or long
format.
Example query
Here is an example query:
SEARCH
tomCruise
following
following
following
FILTER
following KatieH
location-includes USA
PRESENT
sort-by popularity
format long
The search specification in this particular query has four steps.
Start with a list containing the username to start the search from; ie. ['tomCruise'] . Let's
call that list L1.
The search keyword 'following' says to replace each username p in L1 with the usernames
of the users who p is following. This yields a new list, L2.
For the next 'following' keyword, we start with L2 and repeat the same operation as in
the previous step, yielding another list, L3.
Untitled 4
For the final 'following' keyword, we start with L3 and repeat that operation one last time,
yielding list L4.
Notice that each step yields a list of zero or more usernames that is the input to the next step.
There should be no duplicates in the final results list. Duplicates should be removed after each
step.
The Twitter data file diagram_data.txt contains the follower/following relationships as
represented by this diagram. For those relationships, the search specification above would
yield this list of usernames: ['i', 'j', 'h', 'k', 'tomCruise'] . Make sure that you can see how
the four lists, ending with this final one, are generated. Notice that the final list contains the
users you can get to in three "steps" of the "following" relationship, starting from 'tomCruise' .
The final list generated by the search specification becomes the input to the filter
specification. For our current example, the filter specification says that the list should be
filtered in this way: a user should be kept only if they are following 'KatieH' and has a location
that includes the string 'USA' . The presentation specification says to present the results in
long format and to order the users according to their popularity.
Now let's look in detail at the exact format and meaning of a query.
Overall format of a query
The format of the query data file is as described below, in this order:
The keyword SEARCH alone on a line.
The search specification.
The keyword FILTER alone on a line.
The filter specification.
The keyword PRESENT alone on a line.
The presentation specification.
Notice that the keywords above all act as separators in this file and are in all capital letters.
There are other keywords (such as following which can be part of a search specification) that
are not separators; they are not capitalized.
Format and meaning of the search specification
The format of a search specification is as shown below, in this order:
A single username
Zero or more lines, in any order, each containing one of:
the keyword followers
the keyword following
Each step in the search specification involves modifying the current list of users by going
through each user in the current list and finding their followers or who they are following.
Importantly, at each step, we want to create the new results list by replacing each user from
the current list with the users that they are following or who are their followers.
Untitled 5
The possible search operations for a given user are:
followers operation: a list of all usernames for users who are following the given user (ie.
that have the given username in the value list paired with the key 'following' in their data
in the Twitterverse dictionary).
following operation: a list of all usernames for users who the given user is following (ie.
the value list paired with the key 'following' in the data associated with the given user in
the Twitterverse dictionary).
Note that, if the list of usernames is ['A'] , and we perform
a 'followers' or 'following' operation, 'A' will not be in the resulting list, since users cannot
follow themselves. If 'A' is one of multiple usernames in the current list, then 'A' can appear
in the results list, but only if one of the other usernames produces 'A' from the search
operation.
The final list of results should not contain any duplicates. It is most efficient to remove
duplicates after each operation is performed.
Format and meaning of the filter specification
The format of a filter specification is:
Zero or more lines, in any order, each containing one of the following:
the keyword name-includes , a space, and a string to match
the keyword location-includes , a space, and a string to match
the keyword follower , a space, and a string which is a valid username
the keyword following , a space, and a string which is a valid username
Each of the lines has exactly one blank within it (the space separating the keyword from the
next string), and each keyword will appear at most once.
The filters are "additive" in that each filter step should be applied to the list that resulted from
the previous filter step. That is, you start with the list from the search specification, filter it
based on the first filter step to get a new list L1, filter L1 in the second filter step to get L2, and
so on.
The 'following' and 'follower' filters will keep users in the list if they are following a
particular user, or if a particular user is their follower. These filters are defined as:
The 'following' filter keeps only users who have the provided username in
their 'following' list.
The 'follower' filter keeps only users who appear in the 'following' list for the provided
username.
The filters that have 'includes' as part of the keyword will do a simple substring search on
the string representing the relevant data for the user in the Twitterverse dictionary. This means
that if the given string occurs anywhere in a user's name (for a 'name-includes filter' ) or a
user's location (for a 'location-includes' filter), then that user is to be kept in the list. The
substring search should be case-insensitive. For example, if the filter specifies users whose
locations include the string "USA", then users with location "USA", "Modesto, California, USA",
Untitled 6
"kansas, usa" or "Musala, Bulgaria" would be kept, and users with location "United States of
America" would be excluded. This is far from perfect, but don't try to improve on it.
Format and meaning of the presentation specification
The output from your program must include every user whose username is still in the list after
the search and filter specifications have been processed.
The presentation specification describes how the output should be presented. It consists of
these two lines, in order:
The keyword 'sort-by' , then a space, then one of these
keywords: 'username' , 'name' , 'popularity'
Either 'format short' or 'format long' .
The users must be listed in the order indicated by the 'sort-by' portion of the presentation
specification. Output that is sorted by username or name should be in alphabetical order, so
that strings starting with 'a' are at the beginning of the output. Although usernames are
unique, names are not. If any users have the same name, sort them by username. When sorting
users by popularity, the most popular users should come first. A user's popularity is defined to
be the number of followers they have. Not the number of people they are following! If any
users are tied for popularity, sort them by username.
If the presentation specification is for short format output, the output should consist of the
usernames as a Python list of strings. If the presentation specification is for long format
output, the output for each user should consist of a series of lines about the user, as shown in
the example below. The long format output must include a line of ten dashes between each
pair of users, as well as a line of dashes at the very top and very bottom of the output.
If there are no users to be presented (i.e., the 2nd parameter in get_present_string represents
an empty list), then the long format output should just include two lines of dashes.
For the long format, the series of lines describing a single user should be formatted as shown
below:
tomCruise
name: Tom Cruise
location: Los Angeles, CA
website: http://www.tomcruise.com
bio:
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
following: ['katieH', 'NicoleKidman']
Notice that:
The name, location, and website text is reproduced exactly as it was in the Twitter data
file, with the addition of the labels at the start of the line.
The bio begins on the line after the line containing 'bio:' , unlike the other components
that appear on the same line as their "tag".
The bio itself is also reproduced exactly as it was in the Twitter data file, including upper
and lower case, newline characters, etc.
Untitled 7
The "following" list is produced simply by converting the list of username strings to a
string.
Full Example
If the user were to enter the name of the Twitter data file data.txt and the query query3.txt,
the entire interaction with the Twitterverse program should be:
Data file: data.txt
Query file: query3.txt
----------
PerezHilton
name: Perez Hilton
location: Hollywood, California
website: http://www.PerezH...
bio:
Perez Hilton is the creator and writer of one of the most famous websites
in the world. And he also loves music - a lot!
following: ['tomCruise', 'katieH', 'NicoleKidman']
----------
tomCruise
name: Tom Cruise
location: Los Angeles, CA
website: http://www.tomcruise.com
bio:
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
following: ['katieH', 'NicoleKidman']
----------
If the query were changed to request short-format output, the entire interaction with your
program should be:
Data file: data.txt
Query file: query3.txt
['PerezHilton', 'tomCruise']
Required Functions
In the starter code file twitterverse_functions.py , complete the following function definitions. In
addition, you must add some helper functions to aid with the implementation of these required
functions. You will find more details on the Twitterverse dictionary, and query dictionary
format in the next section, and in the starter code file.
Title
Function name:
Parameter types) → Return
type
Full Description (paraphrase to get a proper docstring
description)
Untitled
process_data: (TextIO) ->
Dict[str, Dict[str, Any]]
The parameter represents a Twitter data file that is already open
for reading. Read the file and return the data in the Twitterverse
dictionary format. Note: in the docstring, do not provide example
calls for functions that read files.
Untitled 8
Title
Function name:
Parameter types) → Return
type
Full Description (paraphrase to get a proper docstring
description)
Untitled
process_query: (TextIO) ->
Dict[str, Dict[str, Any]]
The parameter represents a query file that is already open for
reading. Read the file and return the query in the query dictionary
format. Note: in the docstring, do not provide example calls for
functions that read files.
Untitled
all_followers: (Dict[str,
Dict[str, Any]], str) ->
List[str]:
The first parameter represents the data in the Twitterverse
dictionary, and the second parameter represents a username.
Identify all the usernames that are following the user specified by
the second parameter and return them as a list. This is a helper
function for the sorting code below. You are also free to call it in
the code that you write too.
Untitled
get_search_results: (Dict[str,
Dict[str, Any]], Dict[str,
Any]) -> List[str]:
The first parameter represents the data in the Twitterverse
dictionary format, and the second parameter represents the
search specification in the search specification dictionary format.
Perform the specified search on the given Twitter data, and return
a list of strings representing usernames that match the search
criteria. The parameters should not be modified by the function.
Untitled
get_filter_results: (Dict[str,
Dict[str, Any]], List[str],
Dict[str, str]) -> List[str]
The first parameter represents the data in the Twitterverse
dictionary format, the second parameter represents a list of
usernames, and the third parameter represents the filter
specification in the filter specification dictionary format. Apply the
specified filters to the given username list to determine which
usernames to keep, and return the resulting list of usernames. The
parameters should not be modified by the function.
Untitled
get_present_string: (Dict[str,
Dict[str, Any]], List[str],
Dict[str, str]) -> str:
The first parameter represents the data in the Twitterverse
dictionary format, the second parameter represents a list of
usernames, and the third parameter represents the presentation
specification in the presentation specification dictionary format.
Format the results for presentation based on the given
presentation specification and return the formatted string. The
parameters should not be modified by the function.
Data Structures
To increase the readability of our code, we will use dictionaries to represent both the Twitter
data and the query in our program. In this section, we'll look at the format of the data and
query dictionaries.
The Twitterverse Data Dictionary
The type of the Twitterverse dictionary is Dict[str, Dict[str, object]] . More specifically, each
key in the Twitterverse dictionary is the username of a Twitter user, and the value associated
with that key is a dictionary containing additional information about the user. Each of these
inner dictionaries will contain the keys 'name' , 'location' , 'web' , 'bio' , and 'following' . The
value associated with the 'following' key is a list of zero or more strings, and the values
associated with the rest of the keys are strings (and may be the empty string).
For example, if the following user information is included in our Twitter data file...
Untitled 9
tomCruise
Tom Cruise
Los Angeles, CA
http://www.tomcruise.com
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
ENDBIO
katieH
NicoleKidman
END
then the following key/value pair will be in our Twitterverse dictionary:
'tomCruise': {'name': 'Tom Cruise',
'bio': 'Official TomCruise.com crew tweets. We love you guys!\nVisit us at Facebook!',
'location': 'Los Angeles, CA',
'web': 'http://www.tomcruise.com',
'following': ['katieH', 'NicoleKidman']}
And another example: if the following user information is included in our Twitter data file (note
the blank lines for name, location, and web, and zero lines for bio and following):
quietTweeter
ENDBIO
END
then this key/value pair will be in our Twitterverse dictionary.
'quietTweeter': {'name': '',
'bio': '',
'location': '',
'web': '',
'following': []}
Note that dictionaries are not ordered, so don't worry if your dictionary output is in a different
order, as long as the items are the same.
The Query Dictionary
We will also use dictionaries to represent queries in our program. The query dictionary
contains three items; the key 'search' , whose value is a dictionary representing the search
specification, the key 'filter' , whose value is a dictionary representing the filter
specification, and the key 'present' , whose value is a dictionary representing the presentation
specification.
The search specification dictionary contains the keys 'username' and 'operations' , whose
values are the string representing the username to start at, and a list of strings representing
the operations to perform, respectively. Note that the list that is the value associated with the
key 'operations' may be empty, if there is only one line between
Untitled 10
the SEARCH and FILTER keywords in the query file. If there is only one line between SEARCH
and FILTER, then it represents the username start at, and there are no operations to be
performed in the search step. The operations must be performed in the order that they are
listed in the file. Note the change in results if the two operations are reversed in the Full
Example using query3.txt .)
The filter specification is a dictionary with strings representing filter keywords as keys, and
the strings to match as values. Note that this dictionary may be empty if there are no filters to
apply, and that each filter keyword will appear at most once in the filter section of a query file.
The presentation specification is a dictionary with keys 'sort-by' and 'format' , and string
values representing the order to sort the results, and the format to display them in,
respectively.
For example, if our query file looks like this:
SEARCH
tomCruise
following
following
following
FILTER
following KatieH
location-includes USA
PRESENT
sort-by popularity
format long
then the query dictionary will be this (again, the order of items may vary):
{'search': {'username': 'tomCruise', 'operations': ['following', 'following', 'following']},
'filter': {'following': 'KatieH', 'location-includes': 'USA'},
'present': {'sort-by': 'popularity', 'format': 'long'}}
Note that there are other ways that we could have chosen to represent the data and the query
in our program. However using dictionaries instead of, for example, lists, means that our code
contains string literal keys that are descriptive of what is being represented. For example,
using the format above, we will refer to the search specification in the query dictionary
using query['search'] , rather than query[0] , as we might if we were using a list or tuple of
query information.
The Main Program
Once you have correctly implemented the functions in twitterverse_functions.py , execution of
the main program ( twitterverse_program.py ) will:
Read a data file and produce the Twitterverse data dictionary.
Read a query file and produce the query dictionary.
Compute the results of the search operations.
Apply the filters to the search results.
Untitled 11
Create the presentation string.
Required Testing ( unittest )
Write (and submit) a set of unittests for the get_filter_results function. Name this
file test_get_filter_results.py . For each test method, include a brief docstring description
specifying what is being tested. For unittest methods, we do not include a type contract and
the docstring description should not include example calls.
Files to Download
All of the files included in the download for the assignment are listed in this section. These
files must all be placed in the same folder. Download the files by right-clicking on the link and
using your web-browser's "Save as" option. Do not use "cut and paste" as sometimes this can
cause undesirable formatting characters to be added to our Python code.
Please download the Assignment 3 Files and extract the zip archive. The following
paragraphs explain the files you have been given.
Python Code
The starter code for this assignment is in the file twitterverse_functions.py . You need to
complete this file.
The main program in the file twitterverse_program.py . It is complete and must not be
changed.
The starter code for the required unittests is in the file test_get_filter_results.py . You
need to complete the file.
Sample Twitter Data and Sample Queries
We have provided a number of text files containing Twitter data, and queries on that
data. All files in the provided zip file that contain data in their filename are Twitter
data files, and all files in the provided zip file that contain query in their filename are
query files.
The query files query1.txt , query2.txt , and query3.txt are queries on the data
in data.txt , and the file query4.txt is a query on the data in rdata.txt .
MarkUs Tester
We are providing an automatic student tester via MarkUs that tests two things:
whether your code follows the Python style guidelines (note: this is an additional
resource!, and
whether your functions: have the correct parameter(s), are named correctly, and have the
correct return types.
The MarkUs Tester does not test the correctness of your functions, so you must do that
yourself.
Additional Requirements
Untitled 12
Do not add statements that call print , input , or open , or use an import statement.
Do not use any break or continue statements. We are imposing this restriction (and we
have not even taught you these statements) because, for a beginner programmer, they are
very easy to abuse, resulting in terrible code.
Do not modify or add to the import statements provided in the starter code.
Error checking
You may assume that the files actually contain a valid Twitter data file and a valid query file,
respectively. You do not need to check either file to make sure that it is a valid Twitter or
query file.
You may be wondering how you can be sure that the content of the Twitter data file is valid.
For example, there are probably rules about legal Twitter usernames and there are certainly
rules about valid URLs. Don't confirm that the username, URL, or any other elements of the
Twitter data file are valid. Simply read whatever string is in the relevant position in the file and
use it as is.
Reading the Twitter data file
Typically, the easiest way to read through a data file is with a for-loop, but because of the
structure of a Twitter data file, you will need to use while loops.
Watch out for how you handle \n (newline) characters. If you use readline to get a line of the
file, it will have a \n on the end (except possibly on the last line in the file). You need to be
aware of that before you try to compare it to keywords like END . On the other hand, there are
times when you really do need a \n . For instance, when you store the contents of a user's bio,
it must include any newline characters that appear in the middle of the bio, so that you are
able to display it in the same way when using the long format presentation.
Sorting your output
Part of processing the presentation specification requires you to sort the list of users
according to username, name, or popularity. Since you will be sorting a list (of string
usernames), you can use list.sort to accomplish the task. However, by default list.sort uses
simple < to compare pairs of list elements. We know that < on strings compares them
alphabetically. So list.sort will simply sort the strings into alphabetical order. In some cases,
that is not the ordering you want.
Fortunately, there are ways to get around this. In the starter code for twitterverse_functions.py ,
we have provided our own implementation of a sorting function that has a parameter that you
can use to control the ordering. All you have to do is send tweet_sort a function that takes
two parameters of the same type as the elements in your list (in this case, the two parameters
would be username strings) and returns:
1 if the first one should appear before the second one in a sorted list
1 if the first one should appear after the second one in a sorted list, and
0 if they are "tied", so that it doesn't matter which one comes first.
We have also provided some functions that return 1, 1, or 0 for each possible sorting case.
Untitled 13
The simple module sort_demo.py demonstrates how this can be done. It is a good example of
passing a function to a function. We strongly recommend you spend some time understanding
this code before you work on the sorting part of the presentation specification.
Other ways to work with functions
There will be other places where you can eliminate a great deal of duplication in your code by
passing a function as an argument to a function.
Functions can be used similarly to variables. In particular, we can use them as values in
dictionaries. You may find the example in the module dict_of_funcs_demo.py useful in
understanding how to call different functions in different cases, without having to write
lengthy if/elif/else blocks.
Purpose
The purpose of this assignment is to give you practise working with files, building and using
dictionaries, designing functions, and writing unit tests.
Introduction: The Twitterverse
Twitter is a social networking website where users can post very short messages know as
"tweets". Each Twitter user can choose to "follow" other users, which means that they see
those users' tweets. A Twitter user sees the tweets of users they are "following", and their
tweets are seen by their "followers" (the users who follow them).
All the "follow" connections define a network among Twitter users, and it's quite interesting to
look for patterns in the connections. Tools like Twiangulate let you explore questions like
"what connections do my two friends have in common?". In this assignment, you'll write a
program that lets you ask questions (or "queries") about a Twitter dataset.
Any tool for exploring the Twitterverse must get its data from Twitter itself. Twitter
provides an API to allow programmers to write programs that interact with Twitter and extract
data from it. In general, an API is a module that defines functions for accessing underlying
data and performing other tasks without having to know how that data is actually stored and
retrieved.
To make this assignment more manageable for you, we will assume that the information we
need has already been extracted from Twitter and stored in a file.
How To Tackle This Assignment
This is your first experience designing a program of this size. We are providing detailed
advice to help you break the task down into manageable pieces.
Make sure your twitterverse_functions.py module runs without error before submitting. If you
have syntax errors in your module, comment them out before submitting, or we will not be
able to test your functions! Submitted programs that have syntax errors, do not
execute/run, or infinitely loop are subject to a grade of 0.
The Twitter Data File
A Twitter data file contains a series of one or more user profiles, one after the other. Each user
profile has the following elements, in this order:
A line containing a non-blank, non-empty username. You may assume that usernames are
unique, that is, a single username will not occur more than once in the file, and that
usernames do not contain any whitespace.
A line for the user's actual name. If they did not provide a name, this line will be blank.
A line for the user's location, or a blank line if they did not provide one.
A line for the URL of a website, or a blank line if they did not provide one.
Untitled 2
Zero or more lines for the user's bio, then a line with nothing but the keyword ENDBIO on it.
This marks the end of the bio, and is not considered part of it. You may assume that no
bio has the string ENDBIO within it.) If the user did not provide a bio, the ENDBIO line will
come immediately after the website line, with no blank line in between.
Zero or more lines each containing the username of someone that this user is following,
then a line with the keyword END on it. You may assume that no one has END as their
username.) A user cannot be on his or her own following list. You may assume that every
user on a following list has a user profile in the Twitter data file.
Notice that the keywords act as separators in this file. All of their letters are capitalized, and
the keywords contain no punctuation.
Examples
Here is a sample user profile that might occur among many in a file:
tomCruise
Tom Cruise
Los Angeles, CA
http://www.tomcruise.com
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
ENDBIO
katieH
NicoleKidman
END
The file data.txt is a smallish example of a complete Twitter data file (and was made by hand)
and the file rdata.txt is a much larger example (and is made from real data extracted from
Twitter). These should help you confirm your understanding of the file format and will also be
useful in testing your program.
Cycles in the data
Although a user cannot be on their own following or followers lists, there can be "loops" (we
call them "cycles") such as this: user A can be following B who is following A. This is the
shortest possible cycle. Of course, cycles can be longer.
The Query File
Note that the word "query" just means "question". In computer science, we use it to mean a
request for information. For this assignment, a query will be provided in a file. Below we will
review the high level parts of the query, look at an example, and then describe the format of
the query file.
Overview
A query has three components: a search specification, a filter specification, and a presentation
specification.
The search specification describes how to generate a list of Twitter usernames, starting with
an initial username (a list of length one) and then finding their followers or people they are
Untitled 3
following, then people that are those people's followers or who they are following, and so on.
When processing the search specification, don't try to do anything to avoid cycles. For
instance, if the search specification says to find the people who user A is following, and from
there the people they are following, you could find yourself back at user A. Don't try to avoid
that.
After processing the search specification, we have a list of Twitter usernames. Its length could
be zero. For example, if the initial username is 'adalovelace' and the search specification
contains a single 'followers' keyword, then the length of the list will be zero
if 'adalovelace' has no followers.
The filter specification describes how to filter the list of usernames produced by the search
specification. The filtering can be based on
whether or not they are following a particular user,
whether or not a particular user is their follower,
whether their name contains a particular string (case-insensitive), or
whether their location contains a particular string (case-insensitive).
After processing the filter specification, we have a possibly reduced list of usernames.
Once the search results have been found and filtered, the presentation
specification describes how the output should be presented. It specifies on what basis the
results should be sorted, and whether the results should be presented in a short or long
format.
Example query
Here is an example query:
SEARCH
tomCruise
following
following
following
FILTER
following KatieH
location-includes USA
PRESENT
sort-by popularity
format long
The search specification in this particular query has four steps.
Start with a list containing the username to start the search from; ie. ['tomCruise'] . Let's
call that list L1.
The search keyword 'following' says to replace each username p in L1 with the usernames
of the users who p is following. This yields a new list, L2.
For the next 'following' keyword, we start with L2 and repeat the same operation as in
the previous step, yielding another list, L3.
Untitled 4
For the final 'following' keyword, we start with L3 and repeat that operation one last time,
yielding list L4.
Notice that each step yields a list of zero or more usernames that is the input to the next step.
There should be no duplicates in the final results list. Duplicates should be removed after each
step.
The Twitter data file diagram_data.txt contains the follower/following relationships as
represented by this diagram. For those relationships, the search specification above would
yield this list of usernames: ['i', 'j', 'h', 'k', 'tomCruise'] . Make sure that you can see how
the four lists, ending with this final one, are generated. Notice that the final list contains the
users you can get to in three "steps" of the "following" relationship, starting from 'tomCruise' .
The final list generated by the search specification becomes the input to the filter
specification. For our current example, the filter specification says that the list should be
filtered in this way: a user should be kept only if they are following 'KatieH' and has a location
that includes the string 'USA' . The presentation specification says to present the results in
long format and to order the users according to their popularity.
Now let's look in detail at the exact format and meaning of a query.
Overall format of a query
The format of the query data file is as described below, in this order:
The keyword SEARCH alone on a line.
The search specification.
The keyword FILTER alone on a line.
The filter specification.
The keyword PRESENT alone on a line.
The presentation specification.
Notice that the keywords above all act as separators in this file and are in all capital letters.
There are other keywords (such as following which can be part of a search specification) that
are not separators; they are not capitalized.
Format and meaning of the search specification
The format of a search specification is as shown below, in this order:
A single username
Zero or more lines, in any order, each containing one of:
the keyword followers
the keyword following
Each step in the search specification involves modifying the current list of users by going
through each user in the current list and finding their followers or who they are following.
Importantly, at each step, we want to create the new results list by replacing each user from
the current list with the users that they are following or who are their followers.
Untitled 5
The possible search operations for a given user are:
followers operation: a list of all usernames for users who are following the given user (ie.
that have the given username in the value list paired with the key 'following' in their data
in the Twitterverse dictionary).
following operation: a list of all usernames for users who the given user is following (ie.
the value list paired with the key 'following' in the data associated with the given user in
the Twitterverse dictionary).
Note that, if the list of usernames is ['A'] , and we perform
a 'followers' or 'following' operation, 'A' will not be in the resulting list, since users cannot
follow themselves. If 'A' is one of multiple usernames in the current list, then 'A' can appear
in the results list, but only if one of the other usernames produces 'A' from the search
operation.
The final list of results should not contain any duplicates. It is most efficient to remove
duplicates after each operation is performed.
Format and meaning of the filter specification
The format of a filter specification is:
Zero or more lines, in any order, each containing one of the following:
the keyword name-includes , a space, and a string to match
the keyword location-includes , a space, and a string to match
the keyword follower , a space, and a string which is a valid username
the keyword following , a space, and a string which is a valid username
Each of the lines has exactly one blank within it (the space separating the keyword from the
next string), and each keyword will appear at most once.
The filters are "additive" in that each filter step should be applied to the list that resulted from
the previous filter step. That is, you start with the list from the search specification, filter it
based on the first filter step to get a new list L1, filter L1 in the second filter step to get L2, and
so on.
The 'following' and 'follower' filters will keep users in the list if they are following a
particular user, or if a particular user is their follower. These filters are defined as:
The 'following' filter keeps only users who have the provided username in
their 'following' list.
The 'follower' filter keeps only users who appear in the 'following' list for the provided
username.
The filters that have 'includes' as part of the keyword will do a simple substring search on
the string representing the relevant data for the user in the Twitterverse dictionary. This means
that if the given string occurs anywhere in a user's name (for a 'name-includes filter' ) or a
user's location (for a 'location-includes' filter), then that user is to be kept in the list. The
substring search should be case-insensitive. For example, if the filter specifies users whose
locations include the string "USA", then users with location "USA", "Modesto, California, USA",
Untitled 6
"kansas, usa" or "Musala, Bulgaria" would be kept, and users with location "United States of
America" would be excluded. This is far from perfect, but don't try to improve on it.
Format and meaning of the presentation specification
The output from your program must include every user whose username is still in the list after
the search and filter specifications have been processed.
The presentation specification describes how the output should be presented. It consists of
these two lines, in order:
The keyword 'sort-by' , then a space, then one of these
keywords: 'username' , 'name' , 'popularity'
Either 'format short' or 'format long' .
The users must be listed in the order indicated by the 'sort-by' portion of the presentation
specification. Output that is sorted by username or name should be in alphabetical order, so
that strings starting with 'a' are at the beginning of the output. Although usernames are
unique, names are not. If any users have the same name, sort them by username. When sorting
users by popularity, the most popular users should come first. A user's popularity is defined to
be the number of followers they have. Not the number of people they are following! If any
users are tied for popularity, sort them by username.
If the presentation specification is for short format output, the output should consist of the
usernames as a Python list of strings. If the presentation specification is for long format
output, the output for each user should consist of a series of lines about the user, as shown in
the example below. The long format output must include a line of ten dashes between each
pair of users, as well as a line of dashes at the very top and very bottom of the output.
If there are no users to be presented (i.e., the 2nd parameter in get_present_string represents
an empty list), then the long format output should just include two lines of dashes.
For the long format, the series of lines describing a single user should be formatted as shown
below:
tomCruise
name: Tom Cruise
location: Los Angeles, CA
website: http://www.tomcruise.com
bio:
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
following: ['katieH', 'NicoleKidman']
Notice that:
The name, location, and website text is reproduced exactly as it was in the Twitter data
file, with the addition of the labels at the start of the line.
The bio begins on the line after the line containing 'bio:' , unlike the other components
that appear on the same line as their "tag".
The bio itself is also reproduced exactly as it was in the Twitter data file, including upper
and lower case, newline characters, etc.
Untitled 7
The "following" list is produced simply by converting the list of username strings to a
string.
Full Example
If the user were to enter the name of the Twitter data file data.txt and the query query3.txt,
the entire interaction with the Twitterverse program should be:
Data file: data.txt
Query file: query3.txt
----------
PerezHilton
name: Perez Hilton
location: Hollywood, California
website: http://www.PerezH...
bio:
Perez Hilton is the creator and writer of one of the most famous websites
in the world. And he also loves music - a lot!
following: ['tomCruise', 'katieH', 'NicoleKidman']
----------
tomCruise
name: Tom Cruise
location: Los Angeles, CA
website: http://www.tomcruise.com
bio:
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
following: ['katieH', 'NicoleKidman']
----------
If the query were changed to request short-format output, the entire interaction with your
program should be:
Data file: data.txt
Query file: query3.txt
['PerezHilton', 'tomCruise']
Required Functions
In the starter code file twitterverse_functions.py , complete the following function definitions. In
addition, you must add some helper functions to aid with the implementation of these required
functions. You will find more details on the Twitterverse dictionary, and query dictionary
format in the next section, and in the starter code file.
Title
Function name:
Parameter types) → Return
type
Full Description (paraphrase to get a proper docstring
description)
Untitled
process_data: (TextIO) ->
Dict[str, Dict[str, Any]]
The parameter represents a Twitter data file that is already open
for reading. Read the file and return the data in the Twitterverse
dictionary format. Note: in the docstring, do not provide example
calls for functions that read files.
Untitled 8
Title
Function name:
Parameter types) → Return
type
Full Description (paraphrase to get a proper docstring
description)
Untitled
process_query: (TextIO) ->
Dict[str, Dict[str, Any]]
The parameter represents a query file that is already open for
reading. Read the file and return the query in the query dictionary
format. Note: in the docstring, do not provide example calls for
functions that read files.
Untitled
all_followers: (Dict[str,
Dict[str, Any]], str) ->
List[str]:
The first parameter represents the data in the Twitterverse
dictionary, and the second parameter represents a username.
Identify all the usernames that are following the user specified by
the second parameter and return them as a list. This is a helper
function for the sorting code below. You are also free to call it in
the code that you write too.
Untitled
get_search_results: (Dict[str,
Dict[str, Any]], Dict[str,
Any]) -> List[str]:
The first parameter represents the data in the Twitterverse
dictionary format, and the second parameter represents the
search specification in the search specification dictionary format.
Perform the specified search on the given Twitter data, and return
a list of strings representing usernames that match the search
criteria. The parameters should not be modified by the function.
Untitled
get_filter_results: (Dict[str,
Dict[str, Any]], List[str],
Dict[str, str]) -> List[str]
The first parameter represents the data in the Twitterverse
dictionary format, the second parameter represents a list of
usernames, and the third parameter represents the filter
specification in the filter specification dictionary format. Apply the
specified filters to the given username list to determine which
usernames to keep, and return the resulting list of usernames. The
parameters should not be modified by the function.
Untitled
get_present_string: (Dict[str,
Dict[str, Any]], List[str],
Dict[str, str]) -> str:
The first parameter represents the data in the Twitterverse
dictionary format, the second parameter represents a list of
usernames, and the third parameter represents the presentation
specification in the presentation specification dictionary format.
Format the results for presentation based on the given
presentation specification and return the formatted string. The
parameters should not be modified by the function.
Data Structures
To increase the readability of our code, we will use dictionaries to represent both the Twitter
data and the query in our program. In this section, we'll look at the format of the data and
query dictionaries.
The Twitterverse Data Dictionary
The type of the Twitterverse dictionary is Dict[str, Dict[str, object]] . More specifically, each
key in the Twitterverse dictionary is the username of a Twitter user, and the value associated
with that key is a dictionary containing additional information about the user. Each of these
inner dictionaries will contain the keys 'name' , 'location' , 'web' , 'bio' , and 'following' . The
value associated with the 'following' key is a list of zero or more strings, and the values
associated with the rest of the keys are strings (and may be the empty string).
For example, if the following user information is included in our Twitter data file...
Untitled 9
tomCruise
Tom Cruise
Los Angeles, CA
http://www.tomcruise.com
Official TomCruise.com crew tweets. We love you guys!
Visit us at Facebook!
ENDBIO
katieH
NicoleKidman
END
then the following key/value pair will be in our Twitterverse dictionary:
'tomCruise': {'name': 'Tom Cruise',
'bio': 'Official TomCruise.com crew tweets. We love you guys!\nVisit us at Facebook!',
'location': 'Los Angeles, CA',
'web': 'http://www.tomcruise.com',
'following': ['katieH', 'NicoleKidman']}
And another example: if the following user information is included in our Twitter data file (note
the blank lines for name, location, and web, and zero lines for bio and following):
quietTweeter
ENDBIO
END
then this key/value pair will be in our Twitterverse dictionary.
'quietTweeter': {'name': '',
'bio': '',
'location': '',
'web': '',
'following': []}
Note that dictionaries are not ordered, so don't worry if your dictionary output is in a different
order, as long as the items are the same.
The Query Dictionary
We will also use dictionaries to represent queries in our program. The query dictionary
contains three items; the key 'search' , whose value is a dictionary representing the search
specification, the key 'filter' , whose value is a dictionary representing the filter
specification, and the key 'present' , whose value is a dictionary representing the presentation
specification.
The search specification dictionary contains the keys 'username' and 'operations' , whose
values are the string representing the username to start at, and a list of strings representing
the operations to perform, respectively. Note that the list that is the value associated with the
key 'operations' may be empty, if there is only one line between
Untitled 10
the SEARCH and FILTER keywords in the query file. If there is only one line between SEARCH
and FILTER, then it represents the username start at, and there are no operations to be
performed in the search step. The operations must be performed in the order that they are
listed in the file. Note the change in results if the two operations are reversed in the Full
Example using query3.txt .)
The filter specification is a dictionary with strings representing filter keywords as keys, and
the strings to match as values. Note that this dictionary may be empty if there are no filters to
apply, and that each filter keyword will appear at most once in the filter section of a query file.
The presentation specification is a dictionary with keys 'sort-by' and 'format' , and string
values representing the order to sort the results, and the format to display them in,
respectively.
For example, if our query file looks like this:
SEARCH
tomCruise
following
following
following
FILTER
following KatieH
location-includes USA
PRESENT
sort-by popularity
format long
then the query dictionary will be this (again, the order of items may vary):
{'search': {'username': 'tomCruise', 'operations': ['following', 'following', 'following']},
'filter': {'following': 'KatieH', 'location-includes': 'USA'},
'present': {'sort-by': 'popularity', 'format': 'long'}}
Note that there are other ways that we could have chosen to represent the data and the query
in our program. However using dictionaries instead of, for example, lists, means that our code
contains string literal keys that are descriptive of what is being represented. For example,
using the format above, we will refer to the search specification in the query dictionary
using query['search'] , rather than query[0] , as we might if we were using a list or tuple of
query information.
The Main Program
Once you have correctly implemented the functions in twitterverse_functions.py , execution of
the main program ( twitterverse_program.py ) will:
Read a data file and produce the Twitterverse data dictionary.
Read a query file and produce the query dictionary.
Compute the results of the search operations.
Apply the filters to the search results.
Untitled 11
Create the presentation string.
Required Testing ( unittest )
Write (and submit) a set of unittests for the get_filter_results function. Name this
file test_get_filter_results.py . For each test method, include a brief docstring description
specifying what is being tested. For unittest methods, we do not include a type contract and
the docstring description should not include example calls.
Files to Download
All of the files included in the download for the assignment are listed in this section. These
files must all be placed in the same folder. Download the files by right-clicking on the link and
using your web-browser's "Save as" option. Do not use "cut and paste" as sometimes this can
cause undesirable formatting characters to be added to our Python code.
Please download the Assignment 3 Files and extract the zip archive. The following
paragraphs explain the files you have been given.
Python Code
The starter code for this assignment is in the file twitterverse_functions.py . You need to
complete this file.
The main program in the file twitterverse_program.py . It is complete and must not be
changed.
The starter code for the required unittests is in the file test_get_filter_results.py . You
need to complete the file.
Sample Twitter Data and Sample Queries
We have provided a number of text files containing Twitter data, and queries on that
data. All files in the provided zip file that contain data in their filename are Twitter
data files, and all files in the provided zip file that contain query in their filename are
query files.
The query files query1.txt , query2.txt , and query3.txt are queries on the data
in data.txt , and the file query4.txt is a query on the data in rdata.txt .
MarkUs Tester
We are providing an automatic student tester via MarkUs that tests two things:
whether your code follows the Python style guidelines (note: this is an additional
resource!, and
whether your functions: have the correct parameter(s), are named correctly, and have the
correct return types.
The MarkUs Tester does not test the correctness of your functions, so you must do that
yourself.
Additional Requirements
Untitled 12
Do not add statements that call print , input , or open , or use an import statement.
Do not use any break or continue statements. We are imposing this restriction (and we
have not even taught you these statements) because, for a beginner programmer, they are
very easy to abuse, resulting in terrible code.
Do not modify or add to the import statements provided in the starter code.
Error checking
You may assume that the files actually contain a valid Twitter data file and a valid query file,
respectively. You do not need to check either file to make sure that it is a valid Twitter or
query file.
You may be wondering how you can be sure that the content of the Twitter data file is valid.
For example, there are probably rules about legal Twitter usernames and there are certainly
rules about valid URLs. Don't confirm that the username, URL, or any other elements of the
Twitter data file are valid. Simply read whatever string is in the relevant position in the file and
use it as is.
Reading the Twitter data file
Typically, the easiest way to read through a data file is with a for-loop, but because of the
structure of a Twitter data file, you will need to use while loops.
Watch out for how you handle \n (newline) characters. If you use readline to get a line of the
file, it will have a \n on the end (except possibly on the last line in the file). You need to be
aware of that before you try to compare it to keywords like END . On the other hand, there are
times when you really do need a \n . For instance, when you store the contents of a user's bio,
it must include any newline characters that appear in the middle of the bio, so that you are
able to display it in the same way when using the long format presentation.
Sorting your output
Part of processing the presentation specification requires you to sort the list of users
according to username, name, or popularity. Since you will be sorting a list (of string
usernames), you can use list.sort to accomplish the task. However, by default list.sort uses
simple < to compare pairs of list elements. We know that < on strings compares them
alphabetically. So list.sort will simply sort the strings into alphabetical order. In some cases,
that is not the ordering you want.
Fortunately, there are ways to get around this. In the starter code for twitterverse_functions.py ,
we have provided our own implementation of a sorting function that has a parameter that you
can use to control the ordering. All you have to do is send tweet_sort a function that takes
two parameters of the same type as the elements in your list (in this case, the two parameters
would be username strings) and returns:
1 if the first one should appear before the second one in a sorted list
1 if the first one should appear after the second one in a sorted list, and
0 if they are "tied", so that it doesn't matter which one comes first.
We have also provided some functions that return 1, 1, or 0 for each possible sorting case.
Untitled 13
The simple module sort_demo.py demonstrates how this can be done. It is a good example of
passing a function to a function. We strongly recommend you spend some time understanding
this code before you work on the sorting part of the presentation specification.
Other ways to work with functions
There will be other places where you can eliminate a great deal of duplication in your code by
passing a function as an argument to a function.
Functions can be used similarly to variables. In particular, we can use them as values in
dictionaries. You may find the example in the module dict_of_funcs_demo.py useful in
understanding how to call different functions in different cases, without having to write
lengthy if/elif/else blocks.