我正在设计一个代码,需要在早期阶段输入 .fasta 文件。现在,我正在使用此函数验证输入:
def file_validation(fasta):
while True:
try:
file_name= str(raw_input(fasta))
except IOError:
print("Please give the name of the fasta file that exists in the folder!")
continue
if not(file_name.endswith(".fasta")):
print("Please give the name of the file with the .fasta extension!")
else:
break
return file_name
现在,虽然此函数工作正常,但仍然存在一些错误空间,因为用户可能会输入一个文件,该文件虽然文件名以 .fasta 结尾,但内部可能包含一些非 .fasta 内容。我可以做什么来防止这种情况并让用户知道他/她的 .fasta 文件已损坏?
为什么不像 FASTA 一样解析文件并查看它是否损坏?
biopython
,它会因在非 FASTA 文件上返回空生成器而失败:
from Bio import SeqIO
my_file = "example.csv" # Obviously not FASTA
def is_fasta(filename):
with open(filename, "r") as handle:
fasta = SeqIO.parse(handle, "fasta")
return any(fasta) # False when `fasta` is empty, i.e. wasn't a FASTA file
is_fasta(my_file)
# False
Biopython SeqIO 即使使用空文件作为输入也会生成
Bio.SeqIO.FastaIO.FastaIterator
。
可能不是最优雅的解决方案。遍历每一行并检查“>”字符:
# Validate fasta file
fastafile = open("sequences.fasta", "r")
for line in fastafile:
texit = False
if not line.startswith('>'):
texit = True
try:
line = next(fastafile)
except:
texit = True
if line.startswith('>'):
texit = True
if texit:
print("The file provided does not appear to be a proper FASTA file!")
exit()
fastafile.close()