如何将值插入fastq文件中缺失的区域?

问题描述 投票:0回答:2

理解 fastq 文件,其中有 4 行重要内容。

Line starting with @ contains the sequence identifier.
Line containing the DNA sequence.
Line starting with + (plus sign) indicating the beginning of the quality score line.
Line containing the quality scores corresponding to the DNA sequence.

我有一个损坏的 fastq.gz 文件,其中文件中缺少

+
符号。例如,
zcat sample.fastq.gz

@E00592:278:HC7KLCCX2:2:1101:3539:1502 1:N:0:NCATCCTC
NTTCTAAATTGAAGGAAGAACAAGACAAAGAAATACTGGAGACAGAAATTGAATCAAACCATCCTAGAGTGGCTTCTGCTTTACAAGACCA
+
#AAAAJJJJFJJJAJFFAFAJJJJJ-<JJJJJJA-7JFJJJJAJFJFFFJAJJ----JJ-----FJFF-<AF<FFJ-FFFJJJ<FJA<J-F
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:6461:1502 1:N:0:NCATCCTC
NGGGCTGTGAACCAGGCCATCGGGCAAGTGATCTGGCACAGCCAGGACAACAGAGCAGTCTTCCTCTGTGACCACAGGGTTGCCTTTGCAG
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA

第三次读取的文件缺少

+
文件。

预期输出是:

@E00592:278:HC7KLCCX2:2:1101:3539:1502 1:N:0:NCATCCTC
NTTCTAAATTGAAGGAAGAACAAGACAAAGAAATACTGGAGACAGAAATTGAATCAAACCATCCTAGAGTGGCTTCTGCTTTACAAGACCA
+
#AAAAJJJJFJJJAJFFAFAJJJJJ-<JJJJJJA-7JFJJJJAJFJFFFJAJJ----JJ-----FJFF-<AF<FFJ-FFFJJJ<FJA<J-F
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:6461:1502 1:N:0:NCATCCTC
NGGGCTGTGAACCAGGCCATCGGGCAAGTGATCTGGCACAGCCAGGACAACAGAGCAGTCTTCCTCTGTGACCACAGGGTTGCCTTTGCAG
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA

我尝试过:

zcat sample.fastq.gz | awk 'NR%4==0 {print "+"} {print}' | sed 's/^\+$/+/g' > corrected_file.fastq

但它给了我:

@E00592:278:HC7KLCCX2:2:1101:3539:1502 1:N:0:NCATCCTC
NTTCTAAATTGAAGGAAGAACAAGACAAAGAAATACTGGAGACAGAAATTGAATCAAACCATCCTAGAGTGGCTTCTGCTTTACAAGACCA
+
+
#AAAAJJJJFJJJAJFFAFAJJJJJ-<JJJJJJA-7JFJJJJAJFJFFFJAJJ----JJ-----FJFF-<AF<FFJ-FFFJJJ<FJA<J-F
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:6461:1502 1:N:0:NCATCCTC
NGGGCTGTGAACCAGGCCATCGGGCAAGTGATCTGGCACAGCCAGGACAACAGAGCAGTCTTCCTCTGTGACCACAGGGTTGCCTTTGCAG
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
+
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFAs
bash awk sed bioinformatics
2个回答
0
投票

使用

sed

$ sed -e '/^@/{n;/^[[:alpha:]]/{n;/^+/!{i\+' -e '}}}' input_file
@E00592:278:HC7KLCCX2:2:1101:3539:1502 1:N:0:NCATCCTC
NTTCTAAATTGAAGGAAGAACAAGACAAAGAAATACTGGAGACAGAAATTGAATCAAACCATCCTAGAGTGGCTTCTGCTTTACAAGACCA
+
#AAAAJJJJFJJJAJFFAFAJJJJJ-<JJJJJJA-7JFJJJJAJFJFFFJAJJ----JJ-----FJFF-<AF<FFJ-FFFJJJ<FJA<J-F
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:6461:1502 1:N:0:NCATCCTC
NGGGCTGTGAACCAGGCCATCGGGCAAGTGATCTGGCACAGCCAGGACAACAGAGCAGTCTTCCTCTGTGACCACAGGGTTGCCTTTGCAG
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA
@E00592:278:HC7KLCCX2:2:1101:5751:1502 1:N:0:NCATCCTC
NAAAATTCATTCTCTAGGTTCATTACTTGAAGGCCCCTTATGATTAGCAAGGACTTGATTGTCTCAGGACACGGTTTCAATAATAAGTAGC
+
#A<<A-AJJJJFFJJJJJ<7JFFFFAJ-7JJJJJJJ7FF-F-<F<JF77AJJ7FAA-F-<<--<JAJJFJJJFJJ<<JAF<JJFJJFAJFA

0
投票

您可以仅使用 Bash 内置函数来完成此操作:

zcat sample.fastq.gz | while IFS= read -r LINE; do
    if [[ $LINE =~ ^\+ ]]; then
        SEEN_PLUS=1
    else
        if [[ $LINE =~ ^# ]]; then
            if [[ ${SEEN_PLUS:-0} -ne 1 ]]; then
                # Print correction
                printf '+\n'
            fi
        fi
        SEEN_PLUS=0
    fi
    printf -- '%s\n' "$LINE"
done
© www.soinside.com 2019 - 2024. All rights reserved.