获得每个序列最长的范围

问题描述 投票:0回答:1

我有一个GRanges对象,每个seqnames具有不同的基因组范围(例如染色体)。如何获得仅包含每个序列名称/染色体最长范围的GRanges

例如,如果grGRanges

library(GenomicRanges)

# Make a GRanges object
set.seed(123)
gr <- GRanges(seqnames = rep(c("chr1", "chr2", "chr3"), times=2:4),
              ranges = IRanges(start=sample.int(10000, 9), 
                               width = c(3,5,50,20,10,500,100,500,200)))

# Add a column with the width for clarity:
mcols(gr)$width <- width(gr)

gr
#GRanges object with 9 ranges and 1 metadata column:
#      seqnames    ranges strand |     width
#         <Rle> <IRanges>  <Rle> | <integer>
#  [1]     chr1 2463-2465      * |         3
#  [2]     chr1 2511-2515      * |         5
#  [3]     chr2 8718-8767      * |        50
#  [4]     chr2 2986-3005      * |        20
#  [5]     chr2 1842-1851      * |        10
#  [6]     chr3 9334-9833      * |       500
#  [7]     chr3 3371-3470      * |       100
#  [8]     chr3 4761-5260      * |       500
#  [9]     chr3 6746-6945      * |       200
#  -------
#  seqinfo: 3 sequences from an unspecified genome; no seqlengths

然后我想获得以下GRanges

#GRanges object with 3 ranges and 1 metadata column:
#    seqnames      ranges strand |     width
#       <Rle>   <IRanges>  <Rle> | <integer>
#  [1]     chr1 2511-2515      * |         5
#  [2]     chr2 8718-8767      * |        50
#  [3]     chr3 9334-9833      * |       500

对于我的应用程序,我可以只获得chr3的第一个最长范围,但我希望能找到一个也可以选择所有领带的解决方案。

r bioconductor genomicranges
1个回答
0
投票

我找到了这个解决方案:

library(GenomicRanges)

# Make the GRanges object
set.seed(123)
gr <- GRanges(seqnames = rep(c("chr1", "chr2", "chr3"), times=2:4),
              ranges = IRanges(start=sample.int(10000, 9), 
                               width = c(3,5,50,20,10,500,100,500,200)))

# Add a column with the width for clarity:
mcols(gr)$width <- width(gr)

# split gr by seqnames
grl <- split(gr, seqnames(gr))

# Get the width of each range organized as an IntegerList
wgr <- width(grl)
# if gr is ordered by seqnames this is equivalent to:
wgr <- splitAsList(width(gr), seqnames(gr))
wgr
#IntegerList of length 3
#[["chr1"]] 3 5
#[["chr2"]] 50 20 10
#[["chr3"]] 500 100 500 200

# Get the ranges with the largest width
# See ?'which.max,NumericList-method' for the global argument
longestRanges <- gr[which.max(wgr, global = TRUE)]

longestRanges
#GRanges object with 3 ranges and 1 metadata column:
#      seqnames      ranges strand |     width
#         <Rle>   <IRanges>  <Rle> | <integer>
#  [1]     chr1 2511-2515      * |         5
#  [2]     chr2 8718-8767      * |        50
#  [3]     chr3 9334-9833      * |       500
#  -------
#  seqinfo: 3 sequences from an unspecified genome; no seqlengths

请注意,仅当gr中的范围最初由seqnames分组/排序时,此解决方案才有效。如果不是,则需要从grgr <- sort(gr)的排序开始

要获得所有联系,可以使用:

longestRanges <- unlist(grl[which(wgr == max(wgr))])

longestRanges
#GRanges object with 4 ranges and 1 metadata column:
#       seqnames    ranges strand |     width
#          <Rle> <IRanges>  <Rle> | <integer>
#  chr1     chr1 2511-2515      * |         5
#  chr2     chr2 8718-8767      * |        50
#  chr3     chr3 9334-9833      * |       500
#  chr3     chr3 4761-5260      * |       500
#  -------
#  seqinfo: 3 sequences from an unspecified genome; no seqlengths
© www.soinside.com 2019 - 2024. All rights reserved.