试图以Dask数据框架的形式读取ORC。

问题描述 投票:0回答:1

我在s3中有一个ORC文件,我想把它读到Dask的数据框架中。 我正在使用conda来获得一个python 3.7的虚拟环境,并且我已经安装了Dask。 我的环境是这样的。

# Name                    Version                   Build  Channel
appnope                   0.1.0                    py37_0  
backcall                  0.1.0                    py37_0  
blas                      1.0                         mkl  
bokeh                     2.0.2                    py37_0  
ca-certificates           2020.1.1                      0  
certifi                   2020.4.5.1               py37_0  
click                     7.1.2                      py_0  
cloudpickle               1.4.1                      py_0  
cytoolz                   0.10.1           py37h1de35cc_0  
dask                      2.17.0                     py_0  
dask-core                 2.17.0                     py_0  
decorator                 4.4.2                      py_0  
distributed               2.17.0                   py37_0  
entrypoints               0.3                      py37_0  
freetype                  2.9.1                hb4e5f40_0  
fsspec                    0.7.1                      py_0  
heapdict                  1.0.1                      py_0  
intel-openmp              2019.4                      233  
ipykernel                 5.1.4            py37h39e3cac_0  
ipython                   7.13.0           py37h5ca1d4c_0  
ipython_genutils          0.2.0                    py37_0  
jedi                      0.17.0                   py37_0  
jinja2                    2.11.2                     py_0  
jpeg                      9b                   he5867d9_2  
jupyter_client            6.1.3                      py_0  
jupyter_core              4.6.3                    py37_0  
libcxx                    10.0.0                        1  
libedit                   3.1.20181209         hb402a30_0  
libffi                    3.3                  h0a44026_1  
libgfortran               3.0.1                h93005f0_2  
libpng                    1.6.37               ha441bb4_0  
libsodium                 1.0.16               h3efe00b_0  
libtiff                   4.1.0                hcb84e12_0  
locket                    0.2.0                    py37_1  
markupsafe                1.1.1            py37h1de35cc_0  
mkl                       2019.4                      233  
mkl-service               2.3.0            py37hfbe908c_0  
mkl_fft                   1.0.15           py37h5e564d8_0  
mkl_random                1.1.1            py37h959d312_0  
msgpack-python            1.0.0            py37h04f5b5a_1  
ncurses                   6.2                  h0a44026_1  
numpy                     1.18.1           py37h7241aed_0  
numpy-base                1.18.1           py37h3304bdc_1  
olefile                   0.46                       py_0  
openssl                   1.1.1g               h1de35cc_0  
packaging                 20.3                       py_0  
pandas                    1.0.3            py37h6c726b0_0  
parso                     0.7.0                      py_0  
partd                     1.1.0                      py_0  
pexpect                   4.8.0                    py37_0  
pickleshare               0.7.5                    py37_0  
pillow                    7.1.2            py37h4655f20_0  
pip                       20.0.2                   py37_3  
prompt-toolkit            3.0.4                      py_0  
prompt_toolkit            3.0.4                         0  
psutil                    5.7.0            py37h1de35cc_0  
ptyprocess                0.6.0                    py37_0  
pyarrow                   0.17.1                   pypi_0    pypi
pygments                  2.6.1                      py_0  
pyparsing                 2.4.7                      py_0  
python                    3.7.7                hf48f09d_4  
python-dateutil           2.8.1                      py_0  
pytz                      2020.1                     py_0  
pyyaml                    5.3.1            py37h1de35cc_0  
pyzmq                     18.1.1           py37h0a44026_0  
readline                  8.0                  h1de35cc_0  
setuptools                46.4.0                   py37_0  
six                       1.14.0                   py37_0  
sortedcontainers          2.1.0                    py37_0  
sqlite                    3.31.1               h5c1f38d_1  
tblib                     1.6.0                      py_0  
tk                        8.6.8                ha441bb4_0  
toolz                     0.10.0                     py_0  
tornado                   6.0.4            py37h1de35cc_1  
traitlets                 4.3.3                    py37_0  
typing_extensions         3.7.4.1                  py37_0  
wcwidth                   0.1.9                      py_0  
wheel                     0.34.2                   py37_0  
xz                        5.2.5                h1de35cc_0  
yaml                      0.1.7                hc338f04_2  
zeromq                    4.3.1                h0a44026_3  
zict                      2.0.0                      py_0  
zlib                      1.2.11               h1de35cc_3  
zstd                      1.3.7                h5bba6e5_0 

我试着这样做

import dask.dataframe as dd
orders_path = "s3://bucketname/folder/ord_files_dir/"
orders = dd.read_orc(orders_path)

但我得到了这个错误。

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/anaconda3/envs/dask_env/lib/python3.7/site-packages/dask/utils.py in import_required(mod_name, error_msg)
     96     try:
---> 97         return import_module(mod_name)
     98     except ImportError:

/anaconda3/envs/dask_env/lib/python3.7/importlib/__init__.py in import_module(name, package)
    126             level += 1
--> 127     return _bootstrap._gcd_import(name[level:], package, level)
    128 

/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _gcd_import(name, package, level)

/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _find_and_load(name, import_)

/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _find_and_load_unlocked(name, import_)

/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _load_unlocked(spec)

/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap_external.py in exec_module(self, module)

/anaconda3/envs/dask_env/lib/python3.7/importlib/_bootstrap.py in _call_with_frames_removed(f, *args, **kwds)

/anaconda3/envs/dask_env/lib/python3.7/site-packages/pyarrow/orc.py in <module>
     23 from pyarrow.lib import Schema
---> 24 import pyarrow._orc as _orc
     25 

ModuleNotFoundError: No module named 'pyarrow._orc'

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-3-67de491f90db> in <module>
----> 1 orders = dd.read_orc(orders_path)

/anaconda3/envs/dask_env/lib/python3.7/site-packages/dask/dataframe/io/orc.py in read_orc(path, columns, storage_options)
     46     ...                  'master/examples/demo-11-zlib.orc')  # doctest: +SKIP
     47     """
---> 48     orc = import_required("pyarrow.orc", "Please install pyarrow >= 0.9.0")
     49     import pyarrow as pa
     50 

/anaconda3/envs/dask_env/lib/python3.7/site-packages/dask/utils.py in import_required(mod_name, error_msg)
     97         return import_module(mod_name)
     98     except ImportError:
---> 99         raise RuntimeError(error_msg)
    100 
    101 

RuntimeError: Please install pyarrow >= 0.9.0

据我所知,我使用的是所有相关实体的支持版本 python=3.7 和 pyarrow >= 0.9.0。

如果有任何关于下一步该怎么做的建议,那就太好了!

python dask orc
1个回答
0
投票

从daskdev gitter频道粘贴对话(感谢@uwe-l-korn)。

对于用pip安装的pyarrow。

由于链接问题,轮子中的ORC构建被禁用。

https:/github.comapachearrowblobf79a38169bd2e29b0dc2f27cf0006b9fec613774pythonmanylinux201xbuild_arrow.sh#L46-L48。

也许可以通过更新这些脚本中的ORC和protobuf版本来解决,但这需要一个志愿者来研究。

那么,这个问题最简单的解决方案就是用conda安装pyarrow。

© www.soinside.com 2019 - 2024. All rights reserved.