使用Python进行文本处理

Question

我需要从 25,000,000 条记录中提取 1,500,000 条记录并对它们进行分组。

要提取的记录的组和 UUID 在单独的文件 (200MB) 中定义，格式如下：

>Cluster 0
0   70nt, >90ec66e4-c038-41f0-a553-c94864cf3958... at +/80.00%
1   88nt, >2d45d336-a0f4-4eca-8577-b950e11bb4cf... *
2   70nt, >6f6ad8f1-0cfb-4e57-8962-366cd749fa3f... at +/82.86%
>Cluster 1
0   74nt, >5f584468-a231-416d-9156-ff68e11ee096... *

这是我解析它的函数：

def clstr_parse(filename):
    clstr = None
    with open(filename) as f:
        for line in f:
            if line.startswith('>'):
                if clstr:
                    yield clstr
                clstr = []
            else:
                uuid = line.split()[2][1:37]
                clstr.append(uuid)
    if clstr:
        yield clstr

我用它来提取包含多个 UUID 的“组”（UUID 列表）：

groups = [grp for grp in clstr_parse('file.clstr') if len(grp) >= 2]

然后，我定义一个

dict

，其中 UUID 作为用于在提取过程中存储记录的键：

records = {uuid: None for grp in groups for uuid in grp}

包含要提取的记录的文件 (30GB) 采用以下格式：

@header1
@header...
92fa0cdf-9e1b-4f83-b6e0-ca35885bfdbd    16  <SEQFILE>   4   60  <CIGAR> *   0   0   <SEQUENCE>  <QUALITIES> NM:i:54 ms:i:298    AS:i:246    nn:i:32 tp:A:P  cm:i:18 s1:i:126    s2:i:0  de:f:0.0632 rl:i:0
2d45d336-a0f4-4eca-8577-b950e11bb4cf    16  <SEQFILE>   13  60  <CIGAR> *   0   0   <SEQUENCE>  <QUALITIES> NM:i:43 ms:i:362    AS:i:314    nn:i:32 tp:A:P  cm:i:24 s1:i:154    s2:i:0  de:f:0.0244 SA:Z:SEQ,2500,-,...,40,56;  rl:i:0
2d45d336-a0f4-4eca-8577-b950e11bb4cf    2064    <SEQFILE>   2500    40  <CIGAR> *   0   0   <SEQUENCE>  <QUALITIES> NM:i:56 ms:i:297    AS:i:241    nn:i:32 tp:A:P  cm:i:9  s1:i:63 s2:i:0  de:f:0.0603 SA:Z:SEQ,13,-,...,60,43;    rl:i:0
6f6ad8f1-0cfb-4e57-8962-366cd749fa3f    0   <SEQFILE>   2608    11  <CIGAR>177S64M1D8M1I11M4D2M1D16M2D15M1D9M4S *   0   0   <SEQUENCE>  <QUALITIES> NM:i:14 ms:i:186    AS:i:182    nn:i:0  tp:A:P  cm:i:5  s1:i:47 s2:i:0  de:f:0.0763 rl:i:0
90ec66e4-c038-41f0-a553-c94864cf3958    16  <SEQFILE>   2501    60  <CIGAR> *   0   0   <SEQUENCE>  <QUALITIES> NM:i:47 ms:i:337    AS:i:289    nn:i:32 tp:A:P  cm:i:20 s1:i:115    s2:i:0  de:f:0.0441 SA:Z:Vent_exo-_barcoded,18,-,195M41D229S,60,45; rl:i:0

这是我解析它的函数：

def sam_parse(filename):
    with open(filename) as f:
        for line in f:
            if line.startswith('@'):
                pass
            else:
                yield line
                for line in f:
                    yield line

我用它来执行提取：

for rec in sam_parse('file.sam'):
    (uuid, flag) = rec.split(maxsplit=2)[0:2]
    if uuid in records and int(flag) < 2048:
        records[uuid] = rec[0:-1]

for grp in groups:
    for uuid in grp:
        print(records[uuid])
    print()

问题是，我预计这个程序需要不到 15 分钟才能完成（使用

awk

进行测试），但我启动它已经 8 个小时了，而且它仍在运行。 Python代码有问题吗？

Answer 1

尝试

for rec in sam_parse('file.sam'):
    (uuid, flag) = rec.split(maxsplit=2)[0:2]
    flag = flag.split(" ")[0]

    if uuid in records and int(flag) < 2048:
        records[uuid] = rec[0:-1]

使用Python进行文本处理

问题描述投票：0回答：1

1个回答

最新问题

使用Python进行文本处理

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1