我被一个 python 脚本困住了,该脚本试图检查列表中的某些元素是在基因内还是在基因外。为此,我使用以下非常基本的命令:
chr=[]
pos=[]
te=[]
chrom=[]
start=[]
end=[]
gene=[]
for line2 in infile2:
if "#" not in line2[0]:
line2=line2.strip()
line2=line2.split("\t")
chr.append(line2[0])
pos.append(line2[1])
te.append(line2[2])
for line in infile:
line=line.strip()
line=line.split("\t")
chrom.append(line[0])
start.append(line[1])
end.append(line[2])
gene.append(line[3])
for n in range(len(chrom)):
for m in range(len(chr)):
if chrom[n]==chr[m] and int(pos[n]) in range(int(start[m]), int(end[m])):
print(chr[m], pos[m], te[m], gene[n])
else:
print(chr[m], pos[m], te[m], "intergenic")
infile具有以下结构,其中第一列是染色体,然后是基因的开始和结束位置:
X 100075379 100096509 NM_001306206.2 0 + 100075405 100093350 0 15 84,79,170,137,120,138,124,63,60,142,176,293,111,126,807, 0,1136,1860,2901,3495,3729,6243,7649,10417,11124,12343,12789,16948,17848,20323
X 100075379 100096509 NM_001306209.2 0 + 100075405 100093350 0 14 84,79,170,137,120,138,73,63,142,176,293,111,126,807, 0,1136,1860,2901,3495,3729,6294,7649,11124,12343,12789,16948,17848,20323
infile2 是这样的:
#CHR POS INFO STRAND IDS...
X 100007156 variant -
X 10000849 variant +
X 100024284 variant -
X 10003672 variant -
X 100050489 variant +
输出应如下所示,其中 NM_ 是 RefSeq 格式的基因名称:
X 100628757 variant NM_000061.3
X 101152133 variant NM_001011657.4
X 100602245 variant intergenic
X 100236510 variant intergenic
X 100244318 variant NM_001162491.2
问题是输出只返回“基因间”变体,而我知道我的列表中至少有 30-40% 的元素在基因内。 有人可以帮助我吗?
如果我向
infile2
添加一行我认为应该触发匹配
X 100075380 variant +
此代码(注意末尾的 +1 使范围包含在内)似乎有效:
## -----------------------
## build a version of infile
## -----------------------
infile = """
X 100075379 100096509 NM_001306206.2 0 + 100075405 100093350 0 15 84,79,170,137,120,138,124,63,60,142,176,293,111,126,807, 0,1136,1860,2901,3495,3729,6243,7649,10417,11124,12343,12789,16948,17848,20323
X 100075379 100096509 NM_001306209.2 0 + 100075405 100093350 0 14 84,79,170,137,120,138,73,63,142,176,293,111,126,807, 0,1136,1860,2901,3495,3729,6294,7649,11124,12343,12789,16948,17848,20323
"""
infile = [
[cell.strip() for cell in line.strip().split("\t")[:4]]
for line in infile.split("\n")
if line.strip()
]
print("\ninfile:")
for row in infile:
print(row)
## -----------------------
## -----------------------
## build a version of infile2
## -----------------------
infile2 = """
#CHR POS INFO STRAND IDS...
X 100007156 variant -
X 10000849 variant +
X 100024284 variant -
X 10003672 variant -
X 100050489 variant +
X 100075380 variant +
"""
infile2 = [
[cell.strip() for cell in line.strip().split("\t")[:3]]
for line in infile2.split("\n")
if line.strip()
][1:]
print("\ninfile2:")
for row in infile2:
print(row)
## -----------------------
print("\nResults:")
for infile_row in infile:
chrom = infile_row[0]
start = int(infile_row[1])
end = int(infile_row[2])
gene = infile_row[3]
for infile2_row in infile2:
chrom2 = infile2_row[0]
pos = int(infile2_row[1])
te = infile2_row[2]
if chrom == chrom2 and pos in range(start, end + 1): # should the range be inclusive?
print(chrom, pos, te, gene)
else:
print(chrom, pos, te, "intergenic")
这给了我:
infile:
['X', '100075379', '100096509', 'NM_001306206.2']
['X', '100075379', '100096509', 'NM_001306209.2']
infile2:
['X', '100007156', 'variant']
['X', '10000849', 'variant']
['X', '100024284', 'variant']
['X', '10003672', 'variant']
['X', '100050489', 'variant']
['X', '100075380', 'variant']
Results:
X 100007156 variant intergenic
X 10000849 variant intergenic
X 100024284 variant intergenic
X 10003672 variant intergenic
X 100050489 variant intergenic
X 100075380 variant NM_001306206.2
X 100007156 variant intergenic
X 10000849 variant intergenic
X 100024284 variant intergenic
X 10003672 variant intergenic
X 100050489 variant intergenic
X 100075380 variant NM_001306209.2