如何阅读与read_csv或lambda多个文件

Question

我通过阅读read_csv同一文本文件的两倍。第一次拿到赛在文件中的特定字符串（MSG）与“COL6”是键的列表。这会给我一个数据帧，只有那些符合“COL6”的条目。然后我第二次读取相同的文件（再次read_csv）和打印，如果key1 == key2些列，这是基于“Col1中”。

我基本上有两个问题：1。可我都搜索（read_csv）结合在一起？ 2.即使我保持两个独立的read_csv，我怎么能读取多个文件？现在，我正在读只有一个文件（firstFile.txt），但我想，这样'*.txt'操作为目录中的所有文件read_csv进行与*.txt替换文件名。

数据文件看起来像下面。我想与Col1=12345打印所有行，因为Col6具有价值'This is a test'。

Col1  Col2    Col3    Col4    Col5    Col6
-       -       -       -       -       -
54321 544     657     888     4476    -
12345 345     456     789     1011    'This is a test'
54321 644     857     788     736     -
54321 744     687     898     7436    -
12345 365     856     789     1020    -
12345 385     956     689     1043    -
12345 385     556     889     1055    -
65432 444     676     876     4554    -
-     -       -       -       -       -
54321 544     657     888     776     -
12345 345     456     789     1011    -
54321 587     677     856     7076    -
12345 345     456     789     1011    -
65432 444     676     876     455     -
12345 345     456     789     1011    -
65432 447     776     576     4055    -
-     -       -       -       -       -   
65432 434     376     576     4155    -

我使用的脚本是：

import csv
import pandas as pd
import os
import glob

DL_fields1 = ['Col1', 'Col2']
DL_fields2 = ['Col1', 'Col2','Col3', 'Col4', 'Col5', 'Col6']

MSG = 'This is a test'

iter_csv = pd.read_csv('firstFile.txt', chunksize=1000, usecols=DL_fields1, skiprows=1)
df = pd.concat([chunk[chunk['Special_message'] == MSG] for chunk in iter_csv])

for i, row in df.iterrows():
    key1 = df.loc[i, 'Col1']
    j=0
    for line in pd.read_csv('firstFile.txt', chunksize=1, usecols=DL_fields2, skiprows=1, na_values={'a':'Int64'}):
        key2 = line.loc[j,'Col1']
        j = j + 1
        if (key2 == '-'):
            continue
        elif (int(key1) == int(key2)):
            print (line)

Answer 1

据我了解，你并不需要阅读的CSV文件的两倍。你基本上是希望所有的行，其中MSG在Col6发生。实际上，你可以在一行中实现这一点 -

MSG = 'This is a test'
iter_csv = pd.read_csv('firstFile.txt', chunksize=1000, usecols=DL_fields1, skiprows=1)
# this gives you all the rows where MSG occurs in Col6
df = iter_csv.loc[iter_csv['Col6'] == MSG, :]
# this gives you all the rows where 12345 in Col1
df_12345 = df.loc[iter_csv['Col1'] == 12345,]

您可以创建多个数据子集的这种方式。

为了回答你问题的第二部分，你也可以遍历所有的文本文件，像这样 -

import glob
txt_files = glob.glob("test/*.txt")
for file in txt_files:
    with open(file, 'r') as foo:
        some_df = pd.read_csv(file)

编辑：这是对文件如何循环，找到Col1=12345和Col6=MSG-所有键

import glob
from functools import reduce

results_list = []
MSG = 'This is a test'

txt_files = glob.glob("test/*.txt")
for file in txt_files:
    with open(file, 'r') as foo:
        some_df = pd.read_csv(file, chunksize=1000, usecols=DL_fields1, skiprows=1)
        df = iter_csv.loc[iter_csv['Col6'] == MSG, :]
        # results_list is a list of all such dataframes
        results_list.append(df.loc[iter_csv['Col1'] == 12345, ])

# All results in one big dataframe
result_df = reduce(lambda x,y: pd.concat([x,y]), results_list)

如何阅读与read_csv或lambda多个文件

问题描述投票：0回答：1

1个回答

最新问题

如何阅读与read_csv或lambda多个文件

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1