perl从2个数组中提取常见元素(fastq文件中的常见序列)

问题描述 投票:1回答:2

我有2个文件(paired_end),其中包含来自同一生物体的Fastq格式的读取(File_R1.fastq和File_R2.fastq)。我想用bwa和samtools(生物信息学)来深度确定覆盖率,为此,我需要两个具有相同读取次数的文件,同名(file_1:@ XX00341:4450:6341 1:N:0:AACGTTAA + TTGCAATT和file_2:@XX00341:4450:6341 2:N:0:AACGTTAA + TTGCAATT),在两个文件中的顺序相同,问题是两个文件都有空读(每个文件的相应读取,可能是空的)一个但不在其他!!!!),如果我尝试用bwa运行它由于空序列而失败。

所以我正在尝试创建一个perl脚本来提取两个文件中具有相同名称和顺序的NO空读。

这是fastq文件的格式(每次读取有4行:名称(@ XX00341:19:000H27K25:1:11101:4450:6341 1:N ....),序列(GAGGTGCGTGGTTGTCACCCC ...),Q_head(+ )和质量(FFFFFFFFFFFFFFF ....),按此顺序。

file_r1.fastq(表示为= .... 1:N:0:AACGTTAA + TTGCAATT)

@XX00341:4450:6341 1:N:0:AACGTTAA+TTGCAATT
CGCCCATCATGAGGTGCGTGGTTGTCACCCCCGCAAACGCGGAGGTGTAAACAGGCTCACCTGTGGGTTGTTGGTAGTCGTTATTGTGCTTGCGCTGTTCATCTGGATTACTGGATTCGACGCGTTGAGC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@XX00341:14420:6341 1:N:0:AACGTTAA+TTGCAATT
GCTTTGTCGTCGTCGGTTTTAAAGTGAACCGCTTTACCTGTTTCTGCTTGAACTTGTTCTGCTTGAGGTGCTGCTTCTGGTTTTGTTTCTTCTTTCTGACAACCTACCGCTAGCATTACTGTTGCAGCAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFFAAFFFFFFAFFFFAFFFFFFFFFFAFAFFFFFAFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFAFFFFFFFAFFFFFFF/FFFFFFFFFFFF
@XX00341:10259:6342 1:N:0:AACGTTAA+TTGCAATT

+

@XX00341:6685:6342 1:N:0:AACGTTAA+TTGCAATT
CATTGGAAGGCAAGCCAGAACAAGGCAAGAATATTCCAGATGACATCGTGCGCGTTCGCATCGATCGCAACAGCGGTCTGCTGACTCATAAAGTGGATAGCACGTCCATGTTTGAATACTTTGAGAAAGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF6FFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFAFFFFFFAFFFFFFFFFFFFF
@XX00341:4051:6343 1:N:0:AACGTTAA+TTGCAATT
TCACCATGATCGGATTTATGAATGGTTTAGTGGACAGCATGATCAAAAATGCGATTGCTTGGCAAACCAGCCATTTGCAGATTCATCAAAGTGCTTATCTCGTTAATCCTGAACTGAAAGACACCATACC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@XX00341:12307:6344 1:N:0:AACGTTAA+TTGCAATT
TCCAAATTGAGAGTTAATGTGAAAAAAGAATGACTTTTTGCTTGTCATGCAGCGGATTTGTGTGATACCACTAATGACGCAAACATTTTCGTAGAACATTACACAATACTATTTAACGAAAAAAGAACGA
+
FAAFFAFAAFFAFFF/FF/FFAFFFFFFFFFAAFFF=F=AFFFFFFFFFFFFAFFFFFFFFFFAFFFFFFFFFFAFFFFFAAFFFFFFFFFF/FAFFFFAFFFFFFFFFFFFAAAFFAF///FAFFFFFF
@XX00341:24250:6345 1:N:0:AACGTTAA+TTGCAATT
CGAACATAGAGCAAGCTCTGGTAAGTCCCGATGGGAGCTTAAAGACACAAGTAGACCAAAAGCTCGCAGAACATGGCTTGAGCCGTAAAGTCACCGTGGCGTCGCGCAACTTTCTCACCATTCGGCATCT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

file_r2.fastq(表示为= .... 2:N:0:AACGTTAA + TTGCAATT)

@XX00341:4450:6341 2:N:0:AACGTTAA+TTGCAATT
GCGGCTTTGGTAGCGAAGCGCGTCTACCAACCGCAGCCATGAAACAACTGGCGTTTGAAGTGGAAAAAACGGCAGCGGGCAGTATTCCGGTACTGATTGAAGCCATCGAACAAGTGGCGGTGCCTACCGC
+
AFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@XX00341:14420:6341 2:N:0:AACGTTAA+TTGCAATT
GCGGTTCGAGTGGCCAACGTTGAACTTCATGCTCGGTAAAAAAGCAACCATTTAACGTGGTGATGACAATTAAATATAGGAATAAATGAGAAATTCTTTGCATCACATCTAATAATCCGGTTTGTGTCTT
+
AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFAFAFFFFFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFF
@XX00341:10259:6342 2:N:0:AACGTTAA+TTGCAATT
TCTTTTTTCGTTAAATAGTATTGTGTAAT
+
FFFFFFFFFFFFFAFFFFFFFFFFFFFFF
@XX00341:6685:6342 2:N:0:AACGTTAA+TTGCAATT
GAATAAATGCTGTCTTCGAGGCTGTTACCAACGTATTCGGTTGGCTCCGTGCCTTTCTCAAAGTATTCAAACATGGACGTGCTATCCACTTTATGAGTCAGCAGACCGCTGTTGCGATCGATGCGAACGC
+
AFFFFFFFFAFFFFAFFFFFFFFFFFFFFFFFFFFFFFF/FF/FFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFF=FF=FFFFFFFAFFFFFFF
@XX00341:4051:6343 2:N:0:AACGTTAA+TTGCAATT
GAAGCTATCATGCCATCGGCGAGAAACCGCTCCGATACGGCTTTCACGCTTTGATGCTTGTCTAACGTTGTCACGATGCTTTGCGAGTCAGGTATGGTGTCTTTCAGTTCAGGATTAACGAGATAAGCAC
+
AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@XX00341:12307:6344 2:N:0:AACGTTAA+TTGCAATT
GTTCATCTGTCTGTCGTTCTTTTTTCGTTAAATAGTATTGTGTAATGTTCTACGAAAATGTTTGCGTCATTAGTGGTATCACACAAATCCGCTGCATGACAAGCAAAAAGTCATTCTTTTTTCACATTAA
+
AFFFFFFFFFAFFFFFFFFAFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFF=FFFFFFFAFFFFFFFAFFFFFFFFFF=FAFFFFFFFFAFFFAFFAFFFFFFFFA=AFF6AFFAFAFF
@XX00341:24250:6345 2:N:0:AACGTTAA+TTGCAATT
CCATAGCATGGCAATATCAAAATCCGCAACGGCGATCGGCGGCTTAACAGTAATGAGATCGTCTGAAAAGCCCTCCGCCAACGCCATTTTTTCTGGCACGATGGCAATGAGATCGCGCTTTTTCAATAGA
+
AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFF

在这种情况下,file_1.fastq中的一个读取为空(@ XX00341:10259:6342 1:N ....),第二行和第四行为空(序列和质量),但不在file_2.fastq中( @ XX00341:10259:6342 2:N ....);在这个例子中,两个序列(每个文件中的4行)必须省略!

这是我试图完成的代码:

#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;

my ($fastQ_R1, $fastQ_R2);
      GetOptions (
            'R1|r1=s'   =>\$fastQ_R1,
            'R2|r2=s'   =>\$fastQ_R2,

      );

    sub extract_list {
        my ($file_in) = @_;
        open FILE, '<', $file_in or die "cant open the $file_in\n";

        my (@elements, @list, @seq);
        while(
            defined(my $head    =   <FILE>) &&   # 1 line
            defined(my $seq     =   <FILE>) &&   # 2 line
            defined(my $qhead   =   <FILE>) &&   # 3 line
            defined(my $quality =   <FILE>)      # 4 line
        ){
            if ($seq=~ m/^$/g){
                next;
            }
            else {  push @seq, $head;   }
         }
        close FILE;
        foreach (@seq){
            chomp;
            @elements = split '\s', $_;
                push @list, $elements[0]; # split to eliminate 1:N... (file_1) and 2:N... (file_2)
        }
        return @list;
    }

    my @list_R1 = extract_list ($fastQ_R1);

    my @list_R2 = extract_list  ($fastQ_R2);

到目前为止,我有两个文件中没有空序列(@ list_R1和@ list_R2)的列表,这两个数组的比较将给出两个文件(@common_elements)中存在的空读(有4行!!)从两个数组的比较。

@ list_R2 =

@XX00341:4450:6341   
@XX00341:14420:6341   
@XX00341:6685:6342   
@XX00341:4051:6343   
@XX00341:12307:6344   
@XX00341:24250:6345

@ list_R2 =

@XX00341:4450:6341   
@XX00341:14420:6341   
@XX00341:10259:6342
@XX00341:6685:6342   
@XX00341:4051:6343   
@XX00341:12307:6344   
@XX00341:24250:6345

@ common_elemnts =

@XX00341:4450:6341   
@XX00341:14420:6341   
@XX00341:6685:6342   
@XX00341:4051:6343   
@XX00341:12307:6344   
@XX00341:24250:6345

所以新的数组(@common_elemnts)将用于搜索和提取常见的读取(4行)foreach文件,并将提供两个输出文件:Files_R1_common.fastq(从File_R1.fastq获得)和File_R2_common.fastq(从File_R2.fastq)。

任何建议,将不胜感激!!!!非常感谢

arrays perl bioinformatics sequences fastq
2个回答
2
投票

您最好使用哈希映射来检查列表中的项目是否存在。 构造公共哈希映射:

my %list1_map = map { $_, 1 } @list_R1;
my %common_map = map { $list1_map{$_} ? ($_, 1) : () )} @list_R2;

然后处理您的列表:

for my $item(@list_R1) {
if (defined ($common_map{$item})) {
    # ... process ...       
}

1
投票

在下面的解决方案中,我将FASTQ解析移动到一个单独的类中,该类将从文件中读取4行并返回表示一条记录的id,hdr,seq和qual的hashref。我已经把这堂课作为练习留给你了。这样做可以使下面的逻辑易于理解。可读性很重要。

这段代码只是保存了来自file1的空seqs的ID(即$ hdr =〜/ ^ @(\ S +)/)中的$ 1。然后它读取file2并输出完整记录,并保存空记录的ID。最后,您再次读取file1并通过删除%筛选哈希指示的那些输出完整记录。

该解决方案效率低,因为它读取file1两次。原始问题中没有提到的是FASTQ文件通常包含数百万条记录,因此在%过滤哈希中保存file1中所有未过滤的记录需要大量的RAM,但如果你有大量的RAM,你可以改为适合,但我个人不担心两次阅读文件。

# SAVE IDS OF EMPTY RECORDS FROM FILE1 IN HASH
my %filter;
my $fq = new Fastq($file1);
while (my $rec = nextSeq($fq))
{
    $rec->{seq} or $filter{$rec->{id}} = undef; # not using value
}

# ADD IDS OF EMPTY RECORDS FROM FILE2 TO HASH AND OUTPUT FILTERED_FILE2
open(my $out, '>', "$file2.filtered") or die($!);
$fq = new Fastq($file2);
while (my $rec = nextSeq($fq) )
{
    if ( $rec->{seq} )
    {
        # READ2 FOR THIS PAIR IS NOT EMPTY
        if ( ! exists($filter{$rec->{id}}) )
        {
             # READ1 FOR THIS PAIR IS ALSO NOT EMPTY
             print $out join("\n", $rec->{hdr}, $rec->{seq}, '+', $rec->{qual}), "\n";
        }
    } else
    {
         # READ2 IS EMPTY, ADD TO FILTERED HASH
         $filtered{$rec->{id}} = undef;
    }
}
close($out);

# OUTPUT FILTERED_FILE1
open($out, '>', "$file1.filtered") or die($!);
$fq = new Fastq($file1);
while (my $rec = nextSeq($fq) )
{
    if ( $rec->{seq} and ! exists($filter{$rec->{id}}) )
    {
         print $out join("\n", $rec->{hdr}, $rec->{seq}, '+', $rec->{qual}), "\n";
    }
}  
close($out);   
© www.soinside.com 2019 - 2024. All rights reserved.