我需要从 25,000,000 条记录中提取 1,500,000 条记录并对它们进行分组。
要提取的记录的组和 UUID 在单独的文件 (200MB) 中定义,格式如下:
>Cluster 0
0 70nt, >90ec66e4-c038-41f0-a553-c94864cf3958... at +/80.00%
1 88nt, >2d45d336-a0f4-4eca-8577-b950e11bb4cf... *
2 70nt, >6f6ad8f1-0cfb-4e57-8962-366cd749fa3f... at +/82.86%
>Cluster 1
0 74nt, >5f584468-a231-416d-9156-ff68e11ee096... *
这是我解析它的函数:
def clstr_parse(filename):
clstr = None
with open(filename) as f:
for line in f:
if line.startswith('>'):
if clstr:
yield clstr
clstr = []
else:
uuid = line.split()[2][1:37]
clstr.append(uuid)
if clstr:
yield clstr
我用它来提取包含多个 UUID 的“组”(UUID 列表):
groups = [grp for grp in clstr_parse('file.clstr') if len(grp) >= 2]
然后,我定义一个
dict
,其中 UUID 作为用于在提取过程中存储记录的键:
records = {uuid: None for grp in groups for uuid in grp}
包含要提取的记录的文件 (30GB) 采用以下格式:
@header1
@header...
92fa0cdf-9e1b-4f83-b6e0-ca35885bfdbd 16 <SEQFILE> 4 60 <CIGAR> * 0 0 <SEQUENCE> <QUALITIES> NM:i:54 ms:i:298 AS:i:246 nn:i:32 tp:A:P cm:i:18 s1:i:126 s2:i:0 de:f:0.0632 rl:i:0
2d45d336-a0f4-4eca-8577-b950e11bb4cf 16 <SEQFILE> 13 60 <CIGAR> * 0 0 <SEQUENCE> <QUALITIES> NM:i:43 ms:i:362 AS:i:314 nn:i:32 tp:A:P cm:i:24 s1:i:154 s2:i:0 de:f:0.0244 SA:Z:SEQ,2500,-,...,40,56; rl:i:0
2d45d336-a0f4-4eca-8577-b950e11bb4cf 2064 <SEQFILE> 2500 40 <CIGAR> * 0 0 <SEQUENCE> <QUALITIES> NM:i:56 ms:i:297 AS:i:241 nn:i:32 tp:A:P cm:i:9 s1:i:63 s2:i:0 de:f:0.0603 SA:Z:SEQ,13,-,...,60,43; rl:i:0
6f6ad8f1-0cfb-4e57-8962-366cd749fa3f 0 <SEQFILE> 2608 11 <CIGAR>177S64M1D8M1I11M4D2M1D16M2D15M1D9M4S * 0 0 <SEQUENCE> <QUALITIES> NM:i:14 ms:i:186 AS:i:182 nn:i:0 tp:A:P cm:i:5 s1:i:47 s2:i:0 de:f:0.0763 rl:i:0
90ec66e4-c038-41f0-a553-c94864cf3958 16 <SEQFILE> 2501 60 <CIGAR> * 0 0 <SEQUENCE> <QUALITIES> NM:i:47 ms:i:337 AS:i:289 nn:i:32 tp:A:P cm:i:20 s1:i:115 s2:i:0 de:f:0.0441 SA:Z:Vent_exo-_barcoded,18,-,195M41D229S,60,45; rl:i:0
这是我解析它的函数:
def sam_parse(filename):
with open(filename) as f:
for line in f:
if line.startswith('@'):
pass
else:
yield line
for line in f:
yield line
我用它来执行提取:
for rec in sam_parse('file.sam'):
(uuid, flag) = rec.split(maxsplit=2)[0:2]
if uuid in records and int(flag) < 2048:
records[uuid] = rec[0:-1]
for grp in groups:
for uuid in grp:
print(records[uuid])
print()
问题是,我预计这个程序需要不到 15 分钟才能完成(使用
awk
进行测试),但我启动它已经 8 个小时了,而且它仍在运行。 Python代码有问题吗?
尝试
for rec in sam_parse('file.sam'):
(uuid, flag) = rec.split(maxsplit=2)[0:2]
flag = flag.split(" ")[0]
if uuid in records and int(flag) < 2048:
records[uuid] = rec[0:-1]