我试图将其变成一个循环,因为我有很多文件,我正在使用 Biopython,但我不确定是否可能。
good_reads = (
rec
for rec in SeqIO.parse(rec1, "fastq")
if max(rec.letter_annotations["phred_quality"]) >= 30
)
count = SeqIO.write(good_reads, "rec1.fastq", "fastq")
鉴于我有rec1、rec2等
我的尝试:
给定输入:
fastq_1.fq
:
@SEQ_ID_1_1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
fastq_2.fq
:
@SEQ_ID_1_3
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@SEQ_ID_2_3
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
fastq_3.fq
:
@SEQ_ID_1_2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@SEQ_ID_2_2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
和代码:
from Bio import SeqIO
import Bio
print('\n-------------------------------')
print('\n Biopython Version : ', Bio.__version__)
print('\n-------------------------------')
fq_list = ['fastq_1.fq' , 'fastq_2.fq', 'fastq_3.fq' ]
good_reads = ([(rec.format("fastq") for rec in SeqIO.parse(seq, "fastq") if max(rec.letter_annotations["phred_quality"]) >= 30) for seq in fq_list ])
print('\n\ngood_reads : ', good_reads)
# this one empties the generators list
# for i in good_reads:
# print(*i)
with open("good_reads_2.fasta", "w+") as output_handle:
for recs in good_reads:
for rec in recs:
output_handle.write(rec)
输出
good_reads_2.fasta
:
@SEQ_ID_1_1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@SEQ_ID_1_3
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@SEQ_ID_1_2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@SEQ_ID_2_2
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
我不是测序专家,但也许这两个链接更能说明碱基质量得分:
让我们知道它们是否有意义