如何在Pyspark中将URL中的csv读入数据帧而不将其写入磁盘?
我试过以下没有运气:
import urllib.request
from io import StringIO
url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv"
response = urllib.request.urlopen(url)
data = response.read()
text = data.decode('utf-8')
f = StringIO(text)
df1 = sqlContext.read.csv(f, header = True, schema=customSchema)
df1.show()
TL; DR这是不可能的,通常通过驱动程序传输数据是一个死胡同。
csv
阅读器只能从URI读取(并且不支持http)。RDD
:
spark.read.csv(sc.parallelize(text.splitlines()))
但数据将写入磁盘。createDataFrame
:
spark.createDataFrame(pd.read_csv(url)))
但这又一次写入磁盘如果文件很小我只使用sparkFiles
:
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
spark.read.csv(SparkFiles.get("iris.csv"), header=True))