我有大数据文件,其格式如下:
1 M * 0.86
2 S * 0.81
3 M * 0.68
4 S * 0.53
5 T . 0.40
6 S . 0.34
7 T . 0.25
8 E . 0.36
9 V . 0.32
10 I . 0.26
11 A . 0.17
12 H . 0.15
13 H . 0.12
14 W . 0.14
15 A . 0.16
16 F . 0.13
17 A . 0.12
18 I . 0.12
19 F . 0.22
20 L . 0.44
21 I * 0.68
22 V * 0.79
23 A * 0.88
24 I * 0.88
25 G * 0.89
26 L * 0.88
27 C * 0.81
28 C * 0.82
29 L * 0.79
30 M * 0.80
31 L * 0.74
32 V * 0.72
33 G * 0.62
我试图弄清楚怎么做是遍历文件中的每一行,如果该行包含星号,则开始查找满足此条件的后续范围。另外,最好在文件中输出最大范围。
因此,在此示例中,所需的输出看起来像:
1-4,21-33 13
感谢您的协助!
有几种方法可以执行此操作。
一种解决方案是逐行读取文件。我建议您看一下关于如何读取文件的非常好的tutorial。
一旦完成,您可以尝试以下操作:
*
:在Python中:
# your file path
filepath = 'test.txt'
with open(filepath) as fp:
line = fp.readline()
# Count the line index
cnt = 1
# Output storing deb and end index
output = []
# While there are lines in the file (e.g. the end of file not reached)
while line:
# Check if the current line has a "*"
if "*" in line:
# If yes, keep the count value, it's the starting point
deb = cnt
# Iterate while there are "*" in line
while "*" in line:
cnt += 1
line = fp.readline()
# END while (e.g end of file or there is no "*" in the line
# Add the starting index and end index to the output
output.append({"deb" : deb, "end": cnt - 1})
# Read next line
cnt += 1
line = fp.readline()
print(output)
# [{'deb': 1, 'end': 4}, {'deb': 21, 'end': 33}]
由于人们正忙于回答,因此此人使用一种生成器来生成范围:
def find_ranges(fn):
with open(fn) as f:
start = None
for line_no, line in enumerate(f):
if start is None:
if '*' in line:
start = line_no + 1 # start of a range
elif '*' not in line:
yield [start, line_no] # seen end of range
start = None
if start is not None: # end of file without seeing end of a range
yield [start, line_no + 1]
ranges = [range for range in find_ranges('test.txt')]
max_range = max(ranges, key = lambda x: x[1] - x[0]) # largest range seen
print(ranges, max_range[1] - max_range[0] + 1)
打印:
[[1, 4], [21, 33]] 13
当然,您可以随意设置范围的格式。
没有使用生成器的相同算法:
def find_ranges(fn):
ranges = []
with open(fn) as f:
start = None
for line_no, line in enumerate(f):
if start is None:
if '*' in line:
start = line_no + 1 # start of a range
elif '*' not in line:
ranges.append([start, line_no]) # end of a range
start = None
if start is not None: # end of file without seeing end of a range
ranges.append([start, line_no + 1])
max_range = max(ranges, key = lambda x: x[1] - x[0])
return ranges, max_range[1] - max_range[0] + 1
ranges, max_range = find_ranges('test.txt')
print(ranges, max_range)