在Perl中查找并合并间隔

问题描述 投票:0回答:2

我从文件中输入的内容如下所示:文件的标签为分隔符,按字母顺序排列样本,数字排序为第2列和第3列中的要素。我想要做的是查找重叠和包含的要素并将它们合并为一个要素。

SampleA 100 500
SampleA 200 600
SampleA 300 400
SampleA 700 800
SampleA 900 1100
SampleA 1200    1500
SampleA 1400    1700
SampleA 1600    1900
SampleB 400 600 
SampleB 700 900 
SampleB 1000    1800    
SampleB 1500    1600
SampleB 1900    2500    
SampleB 2500    2600    
SampleB 3000    3600    
SampleB 3100    3400

例如:前三个SampleA案例将成为:

Sample A 100 600

我现在的问题是,当我迭代我的数据结构时,我可以找到事件但是在尝试合并我的样本时我有点卡住了。我的想法是重做我的循环,直到找到并合并所有内容,但我不知道如何实现这一点。目前,数据存储在2D数组中,如下所示:@storage = [SampleA, start, stop]

my $j = 1;

for (my $i = 0; $i < scalar(@storage); $i++) {

    if ($storage[$i][0] eq $storage[$j][0]) {

        if ($storage[$i][2] > $storage[$j][1] && $storage[$i][2] < $storage[$j][2]) {

            print "Found Overlapp!\n";

        }elsif ( $storage[$i][2] > $storage[$j][1] && $storage[$i][2] > $storage[$j][2]) {

            print "Found Feature in Feature!\n";
        }
    }
    unless ($j == scalar(@storage)){$j++};
}

我的目标是重新运行此循环,直到找不到进一步的匹配,因此所有间隔都不重叠,但我宁愿卡在这里。

perl merge intervals
2个回答
2
投票

我想我会这样做:

#!/usr/bin/env perl

use strict;
use warnings;
use Data::Dumper;

my %ranges;

#iterate line by line. 
while (<>) {
   chomp;
   #split by line
   my ( $name, $start_range, $end_range ) = split;
   #set a variable to see if it's within an existing range. 
   my $in_range = 0;
   #iterate all the existing ones. 
   foreach my $range ( @{ $ranges{$name} } ) {

      #merge if start or end is 'within' this range. 
      if (
         ( $start_range >= $range->{start} and $start_range <= $range->{end} )

         or

         ( $end_range >= $range->{start} and $end_range <= $range->{end} )
        )
      {


         ## then the start or end is within the existing range, so add to it:
         if ( $end_range > $range->{end} ) {
            $range->{end} = $end_range;
         }
         if ( $start_range < $range->{start} ) {
            $range->{start} = $start_range;
         }
         $in_range++;
      }

   }
   #didn't find any matches, so create a new range identity. 
   if ( not $in_range ) {
      push @{ $ranges{$name} }, { start => $start_range, end => $end_range };
   }
}

print Dumper \%ranges;

#iterate by sample
foreach my $sample ( sort keys %ranges ) {
   #iterate by range (sort by lowest start)
   foreach
     my $range ( sort { $a->{start} <=> $b->{start} } @{ $ranges{$sample} } )
   {
      print join "\t", $sample, $range->{start}, $range->{end}, "\n";
   }
}

输出您的数据:

SampleA 100 600 
SampleA 700 800 
SampleA 900 1100    
SampleA 1200    1900    
SampleB 700 900 
SampleB 1000    1800    
SampleB 1900    2600    
SampleB 3000    3600    

这可能不是最有效的算法,因为它会检查所有范围 - 但您可能不需要,因为输入数据是有序的 - 您可以只检查“最新”。


3
投票

由于您输入的排序很好,您可以仅使用固定内存有效地过滤它。

$_ = <> or exit;
my @sample = split;
while (<>) {
    my @newsample = split;
    if ($sample[0] ne $newsample[0]
        || $newsample[2] < $sample[1]
        || $sample[2] < $newsample[1]) {
        # Unmergeable sample
        print "$sample[0]\t$sample[1]\t$sample[2]\n";
        @sample = @newsample;
    }
    elsif ($sample[1] <= $newsample[1] && $newsample[2] <= $sample[2]) {
        # @newsample is included in @sample. Nothing to do
    }
    elsif ($sample[1] <= $newsample[1]) {
        # This @newsample raises the upper limit
        $sample[2] = $newsample[2];
    }
    elsif ($newsample[2] <= $sample[2]) {
        # This @newsample lowers the lower limit.
        $sample[1] = $newsample[1];
    }
    else {
        # This @newsample moves both limits
        @sample = @newsample;
    }
}
# Output the last sample
print "$sample[0]\t$sample[1]\t$sample[2]\n";
© www.soinside.com 2019 - 2024. All rights reserved.