我通过阅读read_csv
同一文本文件的两倍。第一次拿到赛在文件中的特定字符串(MSG)与“COL6”是键的列表。这会给我一个数据帧,只有那些符合“COL6”的条目。然后我第二次读取相同的文件(再次read_csv
)和打印,如果key1 == key2
些列,这是基于“Col1中”。
我基本上有两个问题:1。可我都搜索(read_csv
)结合在一起? 2.即使我保持两个独立的read_csv
,我怎么能读取多个文件?现在,我正在读只有一个文件(firstFile.txt
),但我想,这样'*.txt'
操作为目录中的所有文件read_csv
进行与*.txt
替换文件名。
数据文件看起来像下面。我想与Col1=12345
打印所有行,因为Col6
具有价值'This is a test'
。
Col1 Col2 Col3 Col4 Col5 Col6 - - - - - - 54321 544 657 888 4476 - 12345 345 456 789 1011 'This is a test' 54321 644 857 788 736 - 54321 744 687 898 7436 - 12345 365 856 789 1020 - 12345 385 956 689 1043 - 12345 385 556 889 1055 - 65432 444 676 876 4554 - - - - - - - 54321 544 657 888 776 - 12345 345 456 789 1011 - 54321 587 677 856 7076 - 12345 345 456 789 1011 - 65432 444 676 876 455 - 12345 345 456 789 1011 - 65432 447 776 576 4055 - - - - - - - 65432 434 376 576 4155 -
我使用的脚本是:
import csv
import pandas as pd
import os
import glob
DL_fields1 = ['Col1', 'Col2']
DL_fields2 = ['Col1', 'Col2','Col3', 'Col4', 'Col5', 'Col6']
MSG = 'This is a test'
iter_csv = pd.read_csv('firstFile.txt', chunksize=1000, usecols=DL_fields1, skiprows=1)
df = pd.concat([chunk[chunk['Special_message'] == MSG] for chunk in iter_csv])
for i, row in df.iterrows():
key1 = df.loc[i, 'Col1']
j=0
for line in pd.read_csv('firstFile.txt', chunksize=1, usecols=DL_fields2, skiprows=1, na_values={'a':'Int64'}):
key2 = line.loc[j,'Col1']
j = j + 1
if (key2 == '-'):
continue
elif (int(key1) == int(key2)):
print (line)
据我了解,你并不需要阅读的CSV文件的两倍。你基本上是希望所有的行,其中MSG
在Col6
发生。实际上,你可以在一行中实现这一点 -
MSG = 'This is a test'
iter_csv = pd.read_csv('firstFile.txt', chunksize=1000, usecols=DL_fields1, skiprows=1)
# this gives you all the rows where MSG occurs in Col6
df = iter_csv.loc[iter_csv['Col6'] == MSG, :]
# this gives you all the rows where 12345 in Col1
df_12345 = df.loc[iter_csv['Col1'] == 12345,]
您可以创建多个数据子集的这种方式。
为了回答你问题的第二部分,你也可以遍历所有的文本文件,像这样 -
import glob
txt_files = glob.glob("test/*.txt")
for file in txt_files:
with open(file, 'r') as foo:
some_df = pd.read_csv(file)
编辑:这是对文件如何循环,找到Col1=12345
和Col6=MSG
-所有键
import glob
from functools import reduce
results_list = []
MSG = 'This is a test'
txt_files = glob.glob("test/*.txt")
for file in txt_files:
with open(file, 'r') as foo:
some_df = pd.read_csv(file, chunksize=1000, usecols=DL_fields1, skiprows=1)
df = iter_csv.loc[iter_csv['Col6'] == MSG, :]
# results_list is a list of all such dataframes
results_list.append(df.loc[iter_csv['Col1'] == 12345, ])
# All results in one big dataframe
result_df = reduce(lambda x,y: pd.concat([x,y]), results_list)