SAM 对齐:提取查询序列中的特定区域及其 CIGAR 字符串中的封闭部分

问题描述 投票:0回答:1

我需要对已进行全局比对的 DNA 序列的给定区域执行局部比对,并更新全局 CIGAR 字符串的相应部分。

步骤如下:

1. 根据全局比对的CIGAR字符串,从“query”序列中提取感兴趣的区域。
2. 进行区域的局部对齐并生成局部CIGAR字符串。
3. 从全局 CIGAR 字符串中提取感兴趣区域“之前”和“之后”的操作。
4. 生成新的全局 CIGAR 字符串

我想获得有关 #1#3

的帮助

这是一个具体的例子:

Input:
------
Reference sequence: ACGGCATCAGCGATCATCGGCATATCGAC (against which the global alignment was made)
Query sequence:     TCATCAAGCGTCGCCCTC
CIGAR string:       1S5M1I3M4D6M2D2M
CIGAR position:     5 (1-based, position of the leftmost mapped nucleotide in the reference)
Genomic region:     7-23 (1-based, inclusive)

以及以上数据的解读:

Reference:   ACGGCA|TCA-GCGATCATCGGCAT|ATCGAC
Query:          .CA|TCAAGCG----TCGCCC-|-TC
CIGAR*:         SMM|MMMIMMMDDDDMMMMMMD|DMM
Position:        ↑

注意:基因组区域位于

|
字符

之间

因此,在这个例子中,我需要从

TCAAGCGTCGCCC
中提取
Query
,并从
SMM
中提取
DMM
&
CIGAR*


这是我笨拙的尝试:

import re
import itertools

# Input:
reference = 'ACGGCATCAGCGATCATCGGCATATCGAC'
query     = 'TCATCAAGCGTCGCCCTC'
cigar     = '1S5M1I3M4D6M2D2M'
region    = (7-1, 23)
position  = 5-1

# Now trying the extract the region in the query and the outer parts of the CIGAR:
x0, x1 = region       # range in reference
y0, y1 = (None, None) # range in query
c0, c1 = ([], [])     # cigar operations, before and after the region
x      = position     # current position in reference
y      = 0            # current position in query

for op in itertools.chain.from_iterable(itertools.repeat(m[1], int(m[0])) for m in re.findall(r'(\d+)([MIDS])', cigar)):

    if x < x0: 
        c0.append(op)
    elif x < x1:
        if y0 == None:
            y0 = y 
        else:
            y1 = y 
    else:
        c1.append(op)

    if op == 'M':
        x += 1
        y += 1
    elif op in ['S','I']:
        y += 1
    elif op == 'D':
        x += 1

y1 += 1

print( (''.join(c0), query[y0:y1], ''.join(c1)) )
('SMM', 'TCAAGCGTCGCCCT', 'DMM')

问题是我在查询区域的末尾得到了一个多余的

T

python bioinformatics text-processing
1个回答
0
投票

只有遇到递增

y
的操作才应该更新
y0
y1
;你需要排除
D
:

for op in itertools.chain.from_iterable(itertools.repeat(m[1], int(m[0])) for m in re.findall(r'(\d+)([MIDS])', cigar)):

    if x < x0: 
        c0.append(op)
    elif x < x1:
        if op != 'D':
            if y0 == None:
                y0 = y 
            else:
                y1 = y 
    else:
        c1.append(op)

    if op == 'M':
        x += 1
        y += 1
    elif op in ['S','I']:
        y += 1
    elif op == 'D':
        x += 1

y1 += 1

print( (''.join(c0), query[y0:y1], ''.join(c1)) )
('SMM', 'TCAAGCGTCGCCC', 'DMM')
© www.soinside.com 2019 - 2024. All rights reserved.