我制作了一个 python 脚本,用于从 Uniprot 以 fasta 格式下载蛋白质序列。该脚本将从包含登录号(每行一个)的文本文件中读取登录号,然后尝试从 UniProt 数据库下载相应的序列。这是脚本:
import requests
with open ('testfasta.txt', 'r') as infile:
lines = infile.readlines()
count = 0
for line in lines:
count+=1
line = line.strip()
access_id = line
url_part1 = 'https://rest.uniprot.org/uniprotkb/'
url_part2 = '.fasta'
URL = url_part1+access_id+url_part2
response = requests.get (URL)
with open((access_id)+".fa", "wb") as txtFile:
txtFile.write(response.content)
print ("Total sequences downloaded = ", count)
这工作正常,但对于数百个序列,它将生成大量文件。因此,将下一个传入序列写在第一个序列下方,然后写在其后面的第二个序列,依此类推,这是有益的。 fasta 文件格式基本上是一个包含文本的文本文件,其标题标有“">”。 例如
>firstseq_header
djsfkasdjfkasjdfkasjdflkasjdflkasjdfkasdjfsadk
iewurpwierpofasiodfjlkasdfklasjowieqrudsafdsaf
>secseq_header
dsfjsdfkjasfasdfhwrwerewrasdfasrwerasdfa
awerwerasafas
>nseq_header
ajskdfhjasdfhlasjdhfwueroywieuhsjadfh
hdsfkjh
等等
有这样的事吗?
import requests
with open('testfasta.txt', 'r') as infile,
open('results.fasta', 'w') as outfile:
for count, line in enumerate(infile, 1):
access_id = line.strip()
response = requests.get(
f'https://rest.uniprot.org/uniprotkb/{access_id}.fasta'
# todo: check status code
outfile.write(f'>{access_id}\n')
outfile.write(response.content)
print (f"Total sequences downloaded = {count}")
这假设您获取的数据以换行符结尾,并且仅包含序列本身。我还进行了各种更改以使其更惯用。