我需要对已进行全局比对的 DNA 序列的给定区域执行局部比对,并更新全局 CIGAR 字符串的相应部分。
步骤如下:
1. 根据全局比对的CIGAR字符串,从“query”序列中提取感兴趣的区域。
2. 进行区域的局部对齐并生成局部CIGAR字符串。
3. 从全局 CIGAR 字符串中提取感兴趣区域“之前”和“之后”的操作。
4. 生成新的全局 CIGAR 字符串
我想获得有关 #1 和 #3
的帮助这是一个具体的例子:
Input:
------
Reference sequence: ACGGCATCAGCGATCATCGGCATATCGAC (against which the global alignment was made)
Query sequence: TCATCAAGCGTCGCCCTC
CIGAR string: 1S5M1I3M4D6M2D2M
CIGAR position: 5 (1-based, position of the leftmost mapped nucleotide in the reference)
Genomic region: 7-23 (1-based, inclusive)
以及以上数据的解读:
Reference: ACGGCA|TCA-GCGATCATCGGCAT|ATCGAC
Query: .CA|TCAAGCG----TCGCCC-|-TC
CIGAR*: SMM|MMMIMMMDDDDMMMMMMD|DMM
Position: ↑
注意:基因组区域位于
|
字符之间
因此,在这个例子中,我需要从
TCAAGCGTCGCCC
中提取 Query
,并从 SMM
中提取 DMM
& CIGAR*
。
这是我笨拙的尝试:
import re
import itertools
# Input:
reference = 'ACGGCATCAGCGATCATCGGCATATCGAC'
query = 'TCATCAAGCGTCGCCCTC'
cigar = '1S5M1I3M4D6M2D2M'
region = (7-1, 23)
position = 5-1
# Now trying the extract the region in the query and the outer parts of the CIGAR:
x0, x1 = region # range in reference
y0, y1 = (None, None) # range in query
c0, c1 = ([], []) # cigar operations, before and after the region
x = position # current position in reference
y = 0 # current position in query
for op in itertools.chain.from_iterable(itertools.repeat(m[1], int(m[0])) for m in re.findall(r'(\d+)([MIDS])', cigar)):
if x < x0:
c0.append(op)
elif x < x1:
if y0 == None:
y0 = y
else:
y1 = y
else:
c1.append(op)
if op == 'M':
x += 1
y += 1
elif op in ['S','I']:
y += 1
elif op == 'D':
x += 1
y1 += 1
print( (''.join(c0), query[y0:y1], ''.join(c1)) )
('SMM', 'TCAAGCGTCGCCCT', 'DMM')
问题是我在查询区域的末尾得到了一个多余的
T
。
只有遇到递增
y
的操作才应该更新y0
或y1
;你需要排除 D
:
for op in itertools.chain.from_iterable(itertools.repeat(m[1], int(m[0])) for m in re.findall(r'(\d+)([MIDS])', cigar)):
if x < x0:
c0.append(op)
elif x < x1:
if op != 'D':
if y0 == None:
y0 = y
else:
y1 = y
else:
c1.append(op)
if op == 'M':
x += 1
y += 1
elif op in ['S','I']:
y += 1
elif op == 'D':
x += 1
y1 += 1
print( (''.join(c0), query[y0:y1], ''.join(c1)) )
('SMM', 'TCAAGCGTCGCCC', 'DMM')