辅导GNBF5010、讲解scripts Python编程语言

- 首页 >> Python编程


GNBF5010 Homework 2

Please zip all your files for Homework 2, including the scripts, input files and output files if any, into a

single file called YourLastname_Firstname_HW2.zip (or .rar). Then submit it to the Blackboard

on or before Wednesday, 23 October 2019.

NOTE 1: You will need to add necessary comments in your program to explain your code. Examples of

commenting can be found in the textbook.

NOTE2: Test your program with various test cases to ensure that it works properly.

1. Unknown Letters

Write a program to list which letters in the file seqs.txt are not A, T, C, or G. It should only list

each letter once. Hint: Start with an empty list for unknown letters. Then use two loops to scan

letters in each sequences.

2. Sequence Properties

Write a program, 1) read all sequences in seqs.txt and store them into a list called seqs, 2)

prompt the user a menu for selection of various properties of the seuqences, and 3) show the

corresponding results based on user’s choice. The menu for selection should include:

1) Number of sequences in the input file

2) Number of occurrences of a specific sequence, e.g. GGATC (The program will prompt

another message to the user for the target sequence.)

3) Number of sequences that are longer than a particular length, e.g. 1000 bases (The

program will ask the user again for the minimum length.)

4) Number of sequences with GC content higher than a given value, e.g. 50% (The GC

content could be calculated as (num_of_G + num_of_C) / seq_total_len )

5) The combination of choices 3 and 4: Number of sequences longer than a particular

length and with GC content over a particular value

In your program, there should be separate functions for the analysis in options 1 to 4. Your

program should work like this:

Please select the sequences property that you want to display, or press 0 to

exit the program.

1) Total number of sequences

2) Number of pattern occurrences

3) Number of sequences with length >= min_len

4) Number of sequences with GC% >= min_GC

5) Number of sequences with length >= min_len and GC% >= min_GC

Enter the choice: 4

Enter the minimum GC content (min_GC): 50

Calculating …

There are 36 sequences with GC% >= 50%.

==

Please select the sequences property that you want to display, or press 0 to

exit the program.

GNBF5010 Homework 2

1) Total number of sequences

2) Number of pattern occurrences

3) Number of sequences with length >= min_len

4) Number of sequences with GC% >= min_GC

5) Number of sequences with length >= min_len and GC% >= min_GC

Enter the choice: 5

Enter the minimum length (min_len): 1000

Enter the minimum GC content (min_GC): 40

Calculating …

There are 10 sequences with length >= 1000 bases and GC% >= 40%.

==

Please select the sequences property that you want to display, or press 0 to

exit the program.

1) Total number of sequences

2) Number of pattern occurrences

3) Number of sequences with length >= min_len

4) Number of sequences with GC% >= min_GC

5) Number of sequences with length >= min_len and GC% >= min_GC

Enter the choice: 0

Exiting the program …

3. Unique Words

Write a program that displays a list of all the unique words found in the file uniq_words.txt.

Print your results in alphabetic order and lowercase. Hint: Store words as the elements of a set;

remove punctuations by using the string.punctuation from the string module.

4. Molecular Weight

a) Make a python dictionary of one-letter amino acids codes (the keys) to their molecular

weight (the values), for all 22 amino acids. The molecular weight of 22 amino acids can be

found in the table of next page. As an example, the molecular weight of C (Cysteine) is 121.

b) Print out a list of all the amino acids sorted by their molecular weights from the heaviest to

the lightest. Hint: You may need to sort the items of the dictionary in question (a) based on

the values; example output:

AA MW

W 204Da

Y 181Da

R 174Da

F 165Da

… …

c) Read the protein sequence from lysozyme.fasta and calculate the molecular weight of

this protein using the dictionary created in question (a).

GNBF5010 Homework 2

5. Palindromic sequence

A palindromic sequence is a nucleic acid sequence in a double-stranded DNA or RNA molecule

wherein reading in a certain direction (e.g. 5' to 3') on one strand matches the sequence reading

in the same direction (e.g. 5' to 3') on the complementary strand. Here is an example:

, where on both strands, reading from 5’ to 3’ leads to the same sequence: GAATTC. The DNA

sequence GAATTC is thus said to be palindromic. For more details about the function of

palindromic sequences, see here. Now, write a program that reads DNA sequences from the file

palin_seq.txt and uses recursion to determine whether each of them is a palindromic

sequence. Print the results of your program in the following format.

1) ATCGAT --- YES

2) GAATTC --- YES

3) ATCGGCTA --- NO

Hint: Use string slicing to refer to and compare the characters on either end of the sequence string.


站长地图