按bash中指定的序列长度过滤出FASTA文件

问题描述 投票:1回答:3

[有一个FASTA文件assembly.fasta,其中包含重叠群名称和相应序列:

>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
>contig_3
CGTGATCTCGCCATTCGTGCCG

我只想获取长度超过30个字母的重叠群,并获得一个新的FASTA文件assembly.filtered.fasta,该文件仅包含具有重叠群名称的长序列,格式为:

>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
awk fasta
3个回答
1
投票

您能不能尝试按照所示的示例进行跟踪,测试和编写。

awk '
/^>/{
  if(sign_val && strLen>=30){
    print sign_val ORS line
  }
  strLen=line=""
  sign_val=$0
  next
}
{
  strLen+=length($0)
  line=(line?line ORS:"")$0
}
END{
  if(sign_val && strLen>=30){
    print sign_val ORS line
  }
}
'  Input_file

说明:添加以上详细说明。

awk '                                ##Starting awk program from here.
/^>/{                                ##Checking condition if line starts from > then do following.
  if(sign_val && strLen>=30){        ##Checking if sign_val is SET and steLen is SET then do following.
    print sign_val ORS line          ##Printing sign_val ORS and line here.
  }
  strLen=line=""                     ##Nullify variables steLen and line here.
  sign_val=$0                        ##Setting sign_val to current line here.
  next                               ##next will skip all further statements from here.
}
{
  strLen+=length($0)                 ##Checking length of line and keep adding it here.
  line=(line?line ORS:"")$0          ##Creating line variable and keep appending it to it with new line.
}
END{                                 ##Starting END block of this program from here.
  if(sign_val && strLen>=30){        ##Checking if sign_val is SET and steLen is SET then do following.
    print sign_val ORS line          ##Printing sign_val ORS and line here.
  }
}
' Input_file                         ##mentioning Input_file name here.

4
投票

使用gnu-awk,您可以使用这个简单的版本:

awk -v RS='>[^\n]+\n' 'length() >= 30 {printf "%s", prt $0} {prt = RT}' file

>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA

0
投票

一种快速实现目标的方法是:

awk -v n=30 '/^>/{ if(l>n) print b; b=$0;l=0;next }
            {l+=length;b=b ORS $0}END{if(l>n) print b }' file

您可能还对BioAwk感兴趣,它是awk的改编版,已调整为处理FASTA文件

bioawk -c fastx -v '(length($seq)>30){print ">" $name ORS $seq}' file.fasta

注: BioAwk基于Brian Kernighan's awk中记录的"The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) 。我不确定此版本是否与POSIX兼容。

© www.soinside.com 2019 - 2024. All rights reserved.