读取pyspark目录中按月年分组的最新文件

Question

我的目录中有多个文件。文件名与图1中添加的类似。

我只想从 pyspark 的目录中读取每个月的最新文件作为数据帧。期望读取的文件如图2所示

Answer 1

import os
import glob

path = '/your_path/'
form = 'csv'
os.chdir(path)
files_list = glob.glob('*.{}'.format(form))

dic = {}


prefix = files_list[0][:4]
suffix = files_list[0][-4:]

for i in range(0, len(files_list)):
    
    ym = files_list[i][4:12][:6]
    d = files_list[i][4:12][6:]
    
    if ym in dic:
        if d > dic[ym]:
            dic[ym] = d
    else:
        dic[ym] = d
    
files_to_open = [path+prefix+x+y+suffix for (x,y) in dic.items()]


df = spark.read.format(form).option("header", "true").load(files_to_open)

读取pyspark目录中按月年分组的最新文件

问题描述投票：0回答：1

1个回答

最新问题

读取pyspark目录中按月年分组的最新文件

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1