理解 fastq 文件,其中有 4 行重要内容。
Line starting with @ contains the sequence identifier.
Line containing the DNA sequence.
Line starting with + (plus sign) indicating the beginning of the quality score line.
Line containing the quality scores corresponding to the DNA sequence.
我有一个损坏的 fastq.gz 文件,其中文件中缺少
+
符号。例如,zcat sample.fastq.gz
@E00592:278:HC7KLCCX2:2:1101:3539:1502 1:N:0:NCATCCTC
NTTCTAAATTGAAGGAAGAACAAGACAAAGAAATACTGGAGACAGAAATTGAATCAAACCATCCTAGAGTGGCTTCTGCTTTACAAGACCA
+
#AAAAJJJJFJJJAJFFAFAJJJJJ-<JJJJJJA-7JFJJJJAJFJFFFJAJJ----JJ-----FJFF-<AF<FFJ-FFFJJJ<FJA<J-F
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:6461:1502 1:N:0:NCATCCTC
NGGGCTGTGAACCAGGCCATCGGGCAAGTGATCTGGCACAGCCAGGACAACAGAGCAGTCTTCCTCTGTGACCACAGGGTTGCCTTTGCAG
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
第三次读取的文件缺少
+
文件。
预期输出是:
@E00592:278:HC7KLCCX2:2:1101:3539:1502 1:N:0:NCATCCTC
NTTCTAAATTGAAGGAAGAACAAGACAAAGAAATACTGGAGACAGAAATTGAATCAAACCATCCTAGAGTGGCTTCTGCTTTACAAGACCA
+
#AAAAJJJJFJJJAJFFAFAJJJJJ-<JJJJJJA-7JFJJJJAJFJFFFJAJJ----JJ-----FJFF-<AF<FFJ-FFFJJJ<FJA<J-F
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:6461:1502 1:N:0:NCATCCTC
NGGGCTGTGAACCAGGCCATCGGGCAAGTGATCTGGCACAGCCAGGACAACAGAGCAGTCTTCCTCTGTGACCACAGGGTTGCCTTTGCAG
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
我尝试过:
zcat sample.fastq.gz | awk 'NR%4==0 {print "+"} {print}' | sed 's/^\+$/+/g' > corrected_file.fastq
但它给了我:
@E00592:278:HC7KLCCX2:2:1101:3539:1502 1:N:0:NCATCCTC
NTTCTAAATTGAAGGAAGAACAAGACAAAGAAATACTGGAGACAGAAATTGAATCAAACCATCCTAGAGTGGCTTCTGCTTTACAAGACCA
+
+
#AAAAJJJJFJJJAJFFAFAJJJJJ-<JJJJJJA-7JFJJJJAJFJFFFJAJJ----JJ-----FJFF-<AF<FFJ-FFFJJJ<FJA<J-F
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:6461:1502 1:N:0:NCATCCTC
NGGGCTGTGAACCAGGCCATCGGGCAAGTGATCTGGCACAGCCAGGACAACAGAGCAGTCTTCCTCTGTGACCACAGGGTTGCCTTTGCAG
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
+
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFAs
使用
sed
$ sed -e '/^@/{n;/^[[:alpha:]]/{n;/^+/!{i\+' -e '}}}' input_file
@E00592:278:HC7KLCCX2:2:1101:3539:1502 1:N:0:NCATCCTC
NTTCTAAATTGAAGGAAGAACAAGACAAAGAAATACTGGAGACAGAAATTGAATCAAACCATCCTAGAGTGGCTTCTGCTTTACAAGACCA
+
#AAAAJJJJFJJJAJFFAFAJJJJJ-<JJJJJJA-7JFJJJJAJFJFFFJAJJ----JJ-----FJFF-<AF<FFJ-FFFJJJ<FJA<J-F
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:6461:1502 1:N:0:NCATCCTC
NGGGCTGTGAACCAGGCCATCGGGCAAGTGATCTGGCACAGCCAGGACAACAGAGCAGTCTTCCTCTGTGACCACAGGGTTGCCTTTGCAG
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
您可以仅使用 Bash 内置函数来完成此操作:
zcat sample.fastq.gz | while IFS= read -r LINE; do
if [[ $LINE =~ ^\+ ]]; then
SEEN_PLUS=1
else
if [[ $LINE =~ ^# ]]; then
if [[ ${SEEN_PLUS:-0} -ne 1 ]]; then
# Print correction
printf '+\n'
fi
fi
SEEN_PLUS=0
fi
printf -- '%s\n' "$LINE"
done