Pyspark将文件另存为实木复合地板并读取

Question

我的PySpark脚本将创建的DataFrame保存到目录：

df.write.save(full_path, format=file_format, mode=options['mode'])

如果我在同一运行中读取此文件，一切都很好：

return sqlContext.read.format(file_format).load(full_path)

但是，当我尝试在另一个脚本运行中从该目录读取文件时，出现错误：

java.io.FileNotFoundException: File does not exist: /hadoop/log_files/some_data.json/part-00000-26c649cb-0c0f-421f-b04a-9d6a81bb6767.json

我知道可以通过Spark的技巧找到解决问题的方法：

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

但是，我想知道失败的原因，以及解决这种问题的正统方法是什么？

Answer 1

您正在尝试管理与同一文件相关的两个对象，因此涉及该对象的缓存将给您带来问题，它们都针对同一文件。一个简单的解决方案在这里，

https://stackoverflow.com/a/60328199/5647992

Pyspark将文件另存为实木复合地板并读取

问题描述投票：1回答：1

1个回答

最新问题

Pyspark将文件另存为实木复合地板并读取

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1