如何在一个格式非常特殊的文件中计算对象子串？

Question

我有一个文件的格式是这样的

{'apple': 4, 'orange': 3, 'peach': 1}
{}
{'apple': 1, 'banana': 1}
{'peach': 1}
{}
{}
{'pear': 3}
...

[10k more lines like this]

我想创建一个新的文本文件来存储这些水果对象的总计数，就像这样--。

apple:110
banana:200
pineapple:50
...

我怎么做呢？

我的尝试。我试着用Python (如有疑惑，请跳过) -

f = open("fruits.txt","r")
lines = f.readlines()
f.close()
g = open("number_of_fruits.txt","a")

for line in lines:                           #Iterating through every line,
    for character in "{}'":                       #Removing extra characters,
        line = line.replace(character, "")    

    for i in range(0,line.count(":")):            #Using the number of colons as a counter,
        line = line[ [m.start() for m in re.finditer("[a-z]",line)][i] : [m.start() for m in re.finditer("[0-9]",line)][i] + 1 ] #Slice the line like this - line[ith time I detect any letter : ith time I detect any number + 1]
        #And then somehow store that number in temp, slicing however needed for every new fruit
        #Open a new file
        #First look if any of the fruits in my line already exist
        #If they do:
            #Convert that sliced number part of string to integer, add temp to it, and write it back to the file
        #else:
            #Make a newline entry with the object name and the sliced number from line.

首先Python中的函数数量非常多，让人难以承受。而此时我只是考虑使用C语言，这已经是一个糟糕的想法了。

Answer 1

避免使用eval。

如果你能确保格式化会像上面一样，我会选择把它当作JSON。

import json
from collections import Counter
with open('fruits.txt') as f:
    counts = Counter()
    for line in f.readlines():
        counts.update(json.loads(line.replace("'", '"')))

如果你想按照上面的定义输出。

for fruit, count in counts.items():
    print(f"{fruit}:{count}")

更新答案

根据@DarryIG在评论中的literal_eval建议，否定了JSON的使用。

from ast import literal_eval
from collections import Counter
with open('fruits.txt') as f:
    counts = Counter()
    for line in f.readlines():
        counts.update(literal_eval(line))

Answer 2

你可以使用python的内置函数，比如字面意义_eval 用于在python中对每一行进行字典评估。

from ast import literal_eval
from collections import defaultdict, Counter

with open("input.txt", 'r') as inputFile:
  counts = Counter()
  for line in inputFile:
    a = literal_eval(line)
    counts.update(Counter(a))

print(dict(counts))

输出：

{'apple': 5, 'orange': 3, 'banana': 1, 'peach': 2, 'pear': 3}

Answer 3

使用defaultdict和json

import json
from collections import defaultdict

result = defaultdict(int)
with open('fruits.txt') as f:
    for line in f:
        data = json.loads(line.replace("'", '"'))
        for fruit, num in data.items():
            result[fruit] += num
print(result)

产出

defaultdict(<class 'int'>, {'apple': 5, 'orange': 3, 'peach': 2, 'banana': 1, 'pear': 3})

EDIT：我建议使用@BenjaminRowell的答案（我加了票）。为了简洁起见，我还是保留这个吧。

EDIT2（2020年5月22日）。如果是用双引号而不是单引号，这将是： ndjson jsonlines 格式(这里是有趣的讨论之间的关系）。) 您可以使用 ndjson 或 jsonlines 包来处理它，例如。

import ndjson
from collections import Counter

with open('sample.txt') as f:
    # if using double quotes, you can do:
    #data = ndjson.load(f)

    # because it uses single quotes - read the whole file and replace the quotes
    data = f.read()
    data = ndjson.loads(data.replace("'", '"'))


    counts = Counter()
    for item in data:
        counts.update(item)
print(counts)

如何在一个格式非常特殊的文件中计算对象子串？

问题描述投票：0回答：1

1个回答

更新答案

最新问题

如何在一个格式非常特殊的文件中计算对象子串？

问题描述 投票：0回答：1

1个回答

更新答案

最新问题

问题描述投票：0回答：1