网络中的正则表达式字段匹配和替换-Python

Question

我有一个大的 csv（+1000000 行），我需要对其进行正则表达式搜索和替换功能。简而言之，我需要获取两列并找到它们之间的匹配项；然后使用匹配的行将第三个字段中的值替换为匹配的行。它基本上将网络中的某些程序集与其上游组件相匹配。这是一个简单的小例子：

OID	组装	上游	字段1
1	abc123		1
2	def456	abc123	2
3	ghi789	jkl101	3
4	jkl101		4

这就是预期的结果：

OID	组装	上游	字段1
1	abc123		1
2	def456	abc123	1
3	ghi789	jkl101	4
4	jkl101		4

如您所见，任何具有出现在“程序集”字段中的上游值的行都会获得等于其上游邻居的 Field1 值。

我有一个完美可用但非常慢（写入速度约为 15kb/s）的代码，我目前使用 python 中的正则表达式模块。我的问题是，什么是更有效的方法来做到这一点？由于内存大小有限，Pandas 是不可能的，csv 之外的其他数据格式也是如此。过去我尝试过 dask，但从未让它正常工作，可能是因为在我（非常）受限的 IT 条件下 - 我无权访问 python 中的环境路径变量。

这是代码：

import csv
import re

#csv files
input_file = 'L:\\Dev_h\\Device Heirarchy\\fulljoin_device_flow2.csv'
output_file = 'L:\\Dev_h\\Device Heirarchy\\output2.csv'

# output fields
output_fields = ['gs_attached_assembly_guid', 'gs_upstream_aa_guid', 'Field1_num','Dev_no', 'gs_guid', 'gs_display_feature_guid', 'field2', 'gs_network_feature_name', 'gs_assembly_guid', 'gs_display_feature_name', 'Field1', 'gs_network_feature_guid', 'OID_']


with open(input_file, 'r', newline='') as in_csv, open(output_file, 'w', newline='') as out_csv:
    reader = csv.DictReader(in_csv)
    writer = csv.DictWriter(out_csv, fieldnames=output_fields)
    writer.writeheader()

    # Build Regex
    patterns = {row['gs_attached_assembly_guid']: row['Field1_num'] for row in reader}
    pattern = re.compile('|'.join(map(re.escape, patterns.keys())))

    # restart loop
    in_csv.seek(0)
    next(reader) # Skip header row

    #for loop allowing pattern matching
    for row in reader:

        # Step 6: Define a function to search the 'gs_upstream_aa_guid' column using the regex pattern
        def search_and_replace(match):
            matched_guids = match.group().split(',')
            replacement_values = []
            for matched_guid in matched_guids:
                if matched_guid in patterns and patterns[matched_guid] != '':
                    replacement_values.append(patterns[matched_guid])
                else:
                    # Return an empty string instead of the gs_attached_assembly_guid
                    replacement_values.append('')

            return ','.join(replacement_values)

        # check for matches in 'gs_upstream_aa_guid' value
        match = pattern.search(row['gs_upstream_aa_guid'])

        #If there is a match, replace the 'Field1_num' value with the matched value
        if match:
            row['Field1'] = search_and_replace(match)
        #Otherwise skip
        else:
            pass

        #Write the updated row out to the output CSV
        writer.writerow(row)

print("End")

那么问题是，如何加快这个过程？

Answer 1

您可以删除并替换它，而不是构建一个big正则表达式

match = pattern.search(row['gs_upstream_aa_guid'])

由

match = row['gs_upstream_aa_guid'] in patterns

网络中的正则表达式字段匹配和替换-Python

问题描述投票：0回答：1

1个回答

最新问题

网络中的正则表达式字段匹配和替换-Python

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1