我如何编写脚本来汇总来自多个Fasta文件的信息,而不使用biopython?

问题描述 投票:0回答:1

我有一个生物信息学课程的作业,该课程要求python脚本对其中包含多个蛋白质序列的FASTA文件执行以下操作:

-打开由用户输入指定的.fasta文件

-打印标题行(即以“>”开头的行)

-打印前10个氨基酸并在同一行上,报告序列中的氨基酸数目

经过几个小时的尝试,我仅设法获得了标题行,并且仅为第一个序列打印了前10个氨基酸。我写的for循环似乎无法正常运行(如果这很垃圾,我很抱歉,我是一个完整的初学者!)

input_file = open(input("\nPlease enter the name of a FASTA file (inlcuding the .fasta extension): "))
# Opens the file specified by the user
for line in input_file:
    line = line.rstrip()
    if line[0] == ">":
        header = line
        print("\nHeader:", header)  # prints the header line
        no_lines_searched = 0  # To stop loop after first line of sequence when printing first 10 AAs
        for i in input_file:
            if ">" not in i and no_lines_searched < 1:
                print("First ten amino acid residues: ", i[0:10], "# Amino acids: ")  # Prints the first 10 AAs
                no_lines_searched = no_lines_searched+1  # To stop loop after first line of sequence

[我试图变得聪明,设计第二个环,以使其返回序列的前10个氨基酸,然后停止直到遇到另一个序列(用“>”表示)。

我当时打算使用占位符%s以某种方式计算文件中每个序列的总序列长度,但似乎无法超越这一点!

我得到的输出如下:


Header: >sp|P03378|ENV_HV1A2 Envelope glycoprotein gp160 OS=Human immunodeficiency virus type 1 group M subtype B (isolate ARV2/SF2) GN=env PE=3 SV=1
First ten amino acid residues:  MKVKGTRRNY # Amino acids:

任何帮助将不胜感激!

python fasta
1个回答
0
投票
[今天大部分时间都在解决这个问题之后,我回答了自己的问题。我正在以完全错误的方式进行操作。这是我用来准确实现目标的代码:

#!/usr/bin/env python3 seq = "" # seq is a variable used to store each line of sequence data as the loop below iterates. ten_AAs = '' # ten_AAs is a variable used to store the first ten amino acids of each sequence (it actually stores # the first 10 AAs from each line of the sequence, but only the first 10 residues of each sequence are printed). seq_count = 0 # seq_count is a variable used to track the total number of sequences in the file. infile = input("\nPlease enter the name of a FASTA file (including the .fasta extension): ") # Stores the file specified by the user in the variable, infile. with open(infile, "r") as fasta: # Opens the above file such that we do not need to call filename.close() at the end. print("\n") for line in fasta: # Initiates a for loop to summarise the file: headers, first ten residues and sequence lengths. line = line.rstrip() # Removes trailing newline characters. if line[0] == ">": # Indicates the start of a new FASTA sequence and is a condition for printing summary data if seq != "": # Second condition for printing summary data (see lines 28-30). print("Header: {0}\nFirst ten amino acids: {1} \tSequence length: {2} residues\n" .format(header, ten_AAs[0:10], len(seq))) # Placeholders for summary info. seq_count += 1 # Increases seq_count by one for each header encountered. header = line # Stores the header line for printing. seq = "" # Resets seq for each iteration. ten_AAs = '' # Resets ten_AAs for each iteration. else: seq += line[:] # Appends seq with the characters of the line if it is not a header (i.e. sequence data). ten_AAs += line[0:10] # Appends ten_AAs with the first ten characters of the line if it is not a header. # After reading all lines of one sequence, the loop encounters the next FASTA sequence, starting with a ">" character, # but 'seq' is not blank. Therefore, both 'if' statements are now TRUE and all summary data for the first sequence # (stored in 'header', 'ten_AAs' and 'seq') are printed as per lines 18-19. print("Header: {0}\nFirst ten amino acids: {1} \tSequence length: {2} residues" .format(header, ten_AAs[0:10], len(seq))) # The final sequence is not followed by another accession number; therefore, the summary data for the final sequence # must be printed manually, outside of the loop. print("\nThis multi-FASTA file contains", str(seq_count), "sequences.") # For completeness' sake, reports the number of sequences in the multi-FASTA file.

这是输出:

Please enter the name of a FASTA file (including the .fasta extension): multiprotein.fasta Header: >1433G_HUMAN (P61981) 14-3-3 protein gamma (Protein kinase C inhibitor protein 1) (KCIP-1) [Homo sapiens] First ten amino acids: VDREQLVQKA Sequence length: 246 residues Header: >ATP8_RAT (P11608) ATP synthase protein 8 (EC 3.6.3.14) (ATPase subunit 8) (A6L) (Chargerin II) [Rattus norvegicus] First ten amino acids: MPQLDTSTWF Sequence length: 67 residues Header: >ALR_LISIN (Q92DC9) Alanine racemase (EC 5.1.1.1) First ten amino acids: MVTGWHRPTW Sequence length: 368 residues Header: >CDCA4_HUMAN (Q9BXL8) Cell division cycle-associated protein 4 (Hematopoietic progenitor protein) [Homo sapiens] First ten amino acids: MFARGLKRKC Sequence length: 241 residues This multi-FASTA file contains 4 sequences.

© www.soinside.com 2019 - 2024. All rights reserved.