使用sed和grep的脚本会产生意外的输出

问题描述 投票:1回答:6

我有一个“source.fasta”文件,其中包含以下格式的信息:

>TRINITY_DN80_c0_g1_i1 len=723 path=[700:0-350 1417:351-368 1045:369-722] [-1, 700, 1417, 1045, -2]
CGTGGATAACACATAAGTCACTGTAATTTAAAAACTGTAGGACTTAGATCTCCTTTCTATATTTTTCTGATAACATATGGAACCCTGCCGATCATCCGATTTGTAATATACTTAACTGCTGGATAACTAGCCAAAAGTCATCAGGTTATTATATTCAATAAAATGTAACTTGCCGTAAGTAACAGAGGTCATATGTTCCTGTTCGTCACTCTGTAGTTACAAATTATGACACGTGTGCGCTG
>TRINITY_DN83_c0_g1_i1 len=371 path=[1:0-173 152:174-370] [-1, 1, 152, -2]
GTTGTAAACTTGTATACAATTGGGTTCATTAAAGTGTGCACATTATTTCATAGTTGATTTGATTATTCCGAGTGACCTATTTCGTCACTCGATGTTTAAAGAAATTGCTAGTGTGACCCCAATTGCGTCAGACCAAAGATTGAATCTAGACATTAATTTCCTTTTGTATTTGTATCGAGTAAGTTTACAGTCGTAAATAAAGAATCTGCCTTGAACAAACCTTATTCCTTTTTATTCTAAAAGAGGCCTTTGCGTAGTAGTAACATAGTACAAATTGGTTTATTTAACGATTTATAAACGATATCCTTCCTACAGTCGGGTGAAAAGAAAGTATTCGAAATTAGAATGGTTCCTCATATTACACGTTGCTG
>TRINITY_DN83_c0_g1_i2 len=218 path=[1:0-173 741:174-217] [-1, 1, 741, -2]
GTTGTAAACTTGTATACAATTGGGTTCATTAAAGTGTGCACATTATTTCATAGTTGATTTGATTATTCCGAGTGACCTATTTCGTCACTCGATGTTTAAAGAAATTGCTAGTGTGACCCCAATTGCGTCAGACCAAAGATTGAATCTAGACATTAATTTCCTTTTGTATTTGTACCGAGTAAGTTTCCAGTCGTAAATAAAGAATCTGCCAGATCGGA
>TRINITY_DN99_c0_g1_i1 len=326 path=[1:0-242 221:243-243 652:244-267 246:268-325] [-1, 1, 221, 652, 246, -2]
ATCGGTACTATCATGTCATATATCTAGAAATAATACCTACGAATGTTATAAGAATTTCATAACATGATATAACGATCATACATCATGGCCTTTCGAAGAAAATGGCGCATTTACGTTTAATAATTCCGCGAAAGTCAAGGCAAATACAGACCTAATGCGAAATTGAAAAGAAAATCCGAATATCAGAAACAGAACCCAGAACCAATATGCTCAGCAGTTGCTTTGTAGCCAATAAACTCAACTAGAAATTGCTTATCTTTTATGTAACGCCATAAAACGTAATACCGATAACAGACTAAGCACACATATGTAAATTACCTGCTAAA
>TRINITY_DN90_c0_g1_i1 len=1240 path=[1970:0-527 753:528-1239] [-1, 1970, 753, -2]
GTCGATACTAGACAACGAATAATTGTGTCTATTTTTAAAAATAATTCCTTTTGTAAGCAGATTTTTTTTTTCATGCATGTTTCGAGTAAATTGGATTACGCATTCCACGTAACATCGTAAATGTAACCACATTGTTGTAACATACGGTATTTTTTCTGACAACGGACTCGATTGTAAGCAACTTTGTAACATTATAATCCTATGAGTATGACATTCTTAATAATAGCAACAGGGATAAAAATAAAACTACATTGTTTCATTCAACTCGTAAGTGTTTATTTAAAATTATTATTAAACACTATTGTAATAAAGTTTATATTCCTTTGTCAGTGGTAGACACATAAACAGTTTTCGAGTTCACTGTCG
>TRINITY_DN84_c0_g1_i1 len=301 path=[1:0-220 358:221-300] [-1, 1, 358, -2]
ACTATTATGTAGTACCTACATTAGAAACAACTGACCCAAGACAGGAGAAGTCATTGGATGATTTTCCCCATTAAAAAAAGACAACCTTTTAAGTAAGCATACTCCAAATTAAGGTTTAATTAGCTAAGTGAGCGCGAAAAATGATCAAATATACCGACGTCCATTTGGGGCCTATCCTTTTTAGTGTTCCTAATTGAAATCCTCACGTATACAGCTAGTCACTTTTAAATCTTATAAACATGTGATCCGTCTGCTCATTTGGACGTTACTGCCCAAAGTTGGTACATGTTTCGTACTCACG
>TRINITY_DN84_c0_g1_i2 len=301 path=[1:0-220 199:221-300] [-1, 1, 199, -2]
ACTATTATGTAGTACCTACATTAGAAACAACTGACCCAAGACAGGAGAAGTCATTGGATGATTTTCCCCATTAAAAAAAGACAACCTTTTAAGTAAGCATACTCCAAATTAAGGTTTAATTAGCTAAGTGAGCGCGAAAAATGATCAAATATACCGACGTCCATTTGGGGCCTATCCTTTTTAGTGTTCCTAATTGAAATCCTCACGTATACAGCTAGTCAGCTAACCAAAGATAAGTGTCTTGGCTTGGTATCTACAGATCTCTTTTCGTAATTTCGTGAGTACGAAACATGTACCAACT
>TRINITY_DN72_c0_g1_i1 len=434 path=[412:0-247 847:248-271 661:272-433] [-1, 412, 847, 661, -2]
GTTAATTTAGTGGGAAGTATGTGTTAAAATTAGTAAATTAGGTGTTGGTGTGTTTTTAATATGAATCCGGAAGTGTTTTGTTAGGTTACAAGGGTACGGAATTGTAATAATAGAAATCGGTATCCTTGAGACCAATGTTATCGCATTCGATGCAAGAATAGATTGGGAAATAGTCCGGTTATCAATTACTTAAAGATTTCTATCTTGAAAACTATTTCTAATTGGTAAAAAAACTTATTTAGAATCACCCATAGTTGGAAGTTTAAGATTTGAGACATCTTAAATTTTTGGTAGGTAATTTTAAGATTCTATCGTAGTTAGTACCTTTCGTTCTTCTTATTTTATTTGTAAAATATATTACATTTAGTACGAGTATTGTATTTCCAATATTCAGTCTAATTAGAATTGCAAAATTACTGAACACTCAATCATAA
>TRINITY_DN75_c0_g1_i1 len=478 path=[456:0-477] [-1, 456, -2]
CGAGCACATCAGGCCAGGGTTCCCCAAGTGCTCGAGTTTCGTAACCAAACAACCATCTTCTGGTCCGACCACCAGTCACATGATCAGCTGTGGCGCTCAGTATACGAGCACAGATTGCAACAGCCACCAAATGAGAGAGGAAAGTCATCCACATTGCCATGAAATCTGCGAAAGAGCGTAAATTGCGAGTAGCATGACCGCAGGTACGGCGCAGTAGCTGGAGTTGGCAGCGGCTAGGGGTGCCAGGAGGAGTGCTCCAAGGGTCCATCGTGCTCCACATGCCTCCCCGCCGCTGAACGCGCTCAGAGCCTTGCTCATCTTGCTACGCTCGCTCCGTTCAGTCATCTTCGTGTCTCATCGTCGCAGCGCGTAGTATTTACG

此文件中有近400,000个序列。

我有另一个文件ids.txt采用以下格式:

>TRINITY_DN14840_c10_g1_i1
>TRINITY_DN8506_c0_g1_i1
>TRINITY_DN12276_c0_g2_i1
>TRINITY_DN15434_c5_g3_i1
>TRINITY_DN9323_c8_g3_i5
>TRINITY_DN11957_c1_g7_i1
>TRINITY_DN15373_c1_g1_i1
>TRINITY_DN22913_c0_g1_i1
>TRINITY_DN13029_c4_g5_i1

我在这个文件中有100个序列ID。当我将这些ID与源文件匹配时,我想要一个输出,它给出了每个id与整个序列的匹配。

例如,对于id:

>TRINITY_DN80_c0_g1_i1

我希望我的输出是:

>TRINITY_DN80_c0_g1_i1
CGTGGATAACACATAAGTCACTGTAATTTAAAAACTGTAGGACTTAGATCTCCTTTCTATATTTTTCTGATAACATATGGAACCCTGCCGATCATCCGATTTGTAATATACTTAACTGCTGGATAACTAGCCAAAAGTCATCAGGTTATTATATTCAATAAAATGTAACTTGCCGTAAGTAACAGAGGTCATATGTTCCTGTTCGTCACTCTGTAGTTACAAATTATGACACGTGTGCGCTG

我想要这种格式的所有数百个序列。我用过这段代码:

while read p; do
echo ''$p >> out.fasta
grep -A 400000 -w $p source.fasta | sed -n -e '1,/>/ {/>/ !{'p''}} >> out.fasta
done < ids.txt

但我的输出不同之处在于只有最后一个id有一个序列,其余的没有任何序列关联:

>TRINITY_DN14840_c10_g1_i1
>TRINITY_DN8506_c0_g1_i1
>TRINITY_DN12276_c0_g2_i1
....
>TRINITY_DN10309_c6_g3_i1
>TRINITY_DN6990_c0_g1_i1
TTTTTTTTTTTTTGTGGAAAAACATTGATTTTATTGAATTGTAAACTTAAAATTAGATTGGCTGCACATCTTAGATTTTGTTGAAAGCAGCAATATCAACAGACTGGACGAAGTCTTCGAATTCCTGGATTTTTTCAGTCAAGAGATCAACAGACACTTTGTCGTCTTCAATGACACACATGATCTGCAGTTTGTTGATACCATATCCAACAGGTACAAGTTTGGAAGCTCCCCAGAGGAGACCATCCATTTCGATGGTGCGGACCTGGTTTTCCATTTCTTTCATGTCTGTTTCATCATCCCATGGCTTGACGTCAAGGATTATAGATGATTTAGCAATGAGAGCAGGTTTCTTCGATTTTTTGTCAGCATAAGCTTTCAGACGTTCTTCACGAATTCTGGCGGCCTCTGCATCCTCTTCCTCGTCGCCAGATCCGAATAGGTCGACGTCATCATCGTCGTCATCCTTAGCAGCGGGTGCAGGTGCTGTGGTGGTCTTTCCGCCAGCGGTCAGAGGGCTAGCTCCAGCCGCCCAGGATTTGCGCTCCTCGGCATTGTAGGAGGCAATCTGGTTGTACCACCGGAGAGCGTGGGGCAAGCTTGCGCTCGGGGCCTTGCCGACTTGTTGGAACACTTGGAAATCGGCTTGAGTTGGTGTGTAACCTGACACATAACTCTTATCAGCTAAGAAATTGTTAAGCTCATTAAGGCCTTGTGCGGTTTTAACGTCTCCTACTGCCATTTTTATTTAAAAAAGTAGTTTTTTTCGAGTAATAGCCACACGCCCCGGCACAATGTGAGCAAGAAGGAATGAAAAAGAAATCTGACATTGACATTGCCATGAAATTGACTTTCAAAGAACGAATGAATTGAACTAATTTGAACGG

我只从我的ids.txt获得第100个id的所需输出。有人可以帮助我解决我的脚本错误的问题。我想在运行脚本时获得所有100个序列。谢谢

我已将谷歌驱动器链接添加到我正在使用的文件:ids.txt

Source.fasta

bash awk sed grep fasta
6个回答
2
投票

在大文件上重复循环是低效的;如果可以避免的话,你真的想避免多次运行grep(或sedawk)。一般来说,sed和Awk通常很容易允许您为文件中的各个行指定操作,然后只在文件上运行一次脚本。

对于这个特殊问题,使用NR==FNR的标准Awk习语会派上用场。这是一种允许您将多个键读入内存的机制(具体地说,当NR==FNR表示您正在处理第一个输入文件时,因为整个输入行号等于此文件中的行号)然后检查是否它们存在于后续输入文件中。

回想一下,Awk一次读取一行并执行条件匹配的所有操作。条件是一个简单的布尔值,动作是一对大括号内的一组Awk命令。

awk 'NR == FNR { s[$0]; next }
    # If we fall through to here, we have finished processing the first file.
    # If we see a wedge and p is 1, reset it -- this is a new sequence
    /^>/ && p { p = 0 }
    # If the prefix of this line is in s, we have found a sequence we want.
    ($1$2 in s) || ($1 in s) || ((substr($1, 1, 1) " " substr($1, 2)) in s) {
        if ($1 ~ /^>./) { print $1 } else { print $1 $2 }; p = 1; next }
    # If p is true, we want to print this line
    p' ids.txt source.fasta >out.fasta

因此,当我们阅读ids.txt时,条件NR==FNR为真,因此我们只是将每一行存储在数组s中。 next导致该行的其余Awk脚本被跳过。

在随后的读取中,当NR!=FNR时,我们使用变量p来控制要打印的内容。当我们看到一个新的序列时,我们将p设置为0(如果它是前一次迭代的1)。然后,当我们看到一个新序列时,我们检查它是否在s中,如果是,我们将p设置为1。如果p不为空或零,则最后一行只打印该行。 (空动作是动作{ print }的简写。)

检查$1是否在s中的稍微复杂的条件可能太复杂了 - 我进行了一些标准化以确保>和序列标识符之间的空格是可以容忍的,无论ids.txt中是否存在空格。如果您的文件格式一致,这可能会简化。


1
投票

只有GNU grep和sed:

grep -A 1 -w -F -f ids.txt source.fasta | sed 's/ .*//'

见:man grep


1
投票
$ awk 'NR==FNR{a[$1];next} $1 in a{c=2} c&&c--' ids.txt source.fasta
>TRINITY_DN80_c0_g1_i1 len=723 path=[700:0-350 1417:351-368 1045:369-722] [-1, 700, 1417, 1045, -2]
CGTGGATAACACATAAGTCACTGTAATTTAAAAACTGTAGGACTTAGATCTCCTTTCTATATTTTTCTGATAACATATGGAACCCTGCCGATCATCCGATTTGTAATATACTTAACTGCTGGATAACTAGCCAAAAGTCATCAGGTTATTATATTCAATAAAATGTAACTTGCCGTAAGTAACAGAGGTCATATGTTCCTGTTCGTCACTCTGTAGTTACAAATTATGACACGTGTGCGCTG

以上是使用您发布的source.fasta和此ids.txt运行的:

$ cat ids.txt
>TRINITY_DN14840_c10_g1_i1
>TRINITY_DN80_c0_g1_i1

0
投票

第一组将所有id作为由|分隔的一个表达式像这样

cat ids.txt | tr '\n' '|' | awk "{print "\"" $0 "\""}'

删除最后一个|从表达的符号。

现在你可以使用你从上一个命令得到的输出grep这样

egrep -E ">TRINITY_DN14840_c10_g1_i1|>TRINITY_DN8506_c0_g1_i1|>TRINITY_DN12276_c0_g2_i1|>TRINITY_DN15434_c5_g3_i1|>TRINITY_DN9323_c8_g3_i5|>TRINITY_DN11957_c1_g7_i1|>TRINITY_DN15373_c1_g1_i1|>TRINITY_DN22913_c0_g1_i1|>TRINITY_DN13029_c4_g5_i1" source.fasta

这将仅打印匹配的行

根据tripleee评论进行编辑

使用以下内容正确打印输出假设ID和序列在不同的行中

tr '\n' '|' <ids.txt | sed 's/|$//' | grep -A 1 -E -f - source.fasta

0
投票

这可能适合你(GNU sed):

sed 's#.*#/^&/{s/ .*//;N;p}#' idFile | sed -nf - fastafile

将idFile转换为sed脚本并针对fastaFile运行它。


0
投票

最好的方法是使用python或perl。我能够使用python创建一个用于提取id的脚本,如下所示。

#script to extract sequences from a source file based on ids in another file
#the source is a fasta file with a header and a sequence that follows in one line
#the ids file contains one id per line
#both the id and source file should contain the character '>' at the beginning that siginifies an id

def main():

    #asks the user for the ids file 
    file1 = raw_input('ids file: ');
    #opens the ids file into the memory
    ids_file = open(file1, 'r');
    #asks the user for the fasta file
    file2 = raw_input('fasta file: ');
    #opens the fasta file into memory; you need your memory to be larger than the filesize, or python will hard crash
    fasta_file = open(file2, 'r');

    #ask the user for the file name of output file
    file3 = raw_input('enter the output filename: ');
    #opens output file with append option; append is must as you dont want to override the existing data
    output_file = open(file3, 'w');

    #split the ids into an array
    ids_lines = ids_file.read().splitlines()
    #split the fasta file into an array, the first element will be the id followed by the sequence
    fasta_lines = fasta_file.read().splitlines()

    #initializing loop counters
    i = 0;
    j = 0;

    #while loop to iterate over the length of the ids file as this is the limiter for the program
    while j<len(fasta_lines) and i<len(ids_lines):
            #if statement to match ids from both files and bring matching sequences
            if ids_lines[i] == fasta_lines[j]:
                #output statements including newline characters
                output_file.write(fasta_lines[j])
                output_file.write('\n')
                output_file.write(fasta_lines[j+1])
                output_file.write('\n')
                #increment i so that we go for the next id
                i=i+1;
                #deprecate j so we start all over for the new id
                j=0;
            else:
                #when there is no match check the id, we are skipping the sequence in the middle which is j+1
                j=j+2;

    ids_file.close()
    fasta_file.close()
    output_file.close()

main()`

代码并不完美,但适用于任何数量的ID。我已经测试了我的样本,其中一个包含5000个ID,程序工作正常。如果代码有改进,请这样做,我是一个相对较新的编程,所以代码有点粗糙。

© www.soinside.com 2019 - 2024. All rights reserved.