将数据(大熊猫数据框)添加到镶木地板存储中现有的dask数据框的推荐方法是什么?>>
例如,此测试间歇性失败:
import dask.dataframe as dd import numpy as np import pandas as pd def test_dask_intermittent_error(tmp_path): df = pd.DataFrame(np.random.randn(100, 1), columns=['A'], index=pd.date_range('20130101', periods=100, freq='T')) dfs = np.array_split(df, 2) dd1 = dd.from_pandas(dfs[0], npartitions=1) dd2 = dd.from_pandas(dfs[1], npartitions=1) dd2.to_parquet(tmp_path) _ = (dd1 .append(dd.read_parquet(tmp_path)) .to_parquet(tmp_path)) assert_frame_equal(df, dd.read_parquet(tmp_path).compute())
给予
.venv/lib/python3.7/site-packages/dask/dataframe/core.py:3812: in to_parquet return to_parquet(self, path, *args, **kwargs) ... fastparquet.util.ParquetException: Metadata parse failed: /private/var/folders/_1/m2pd_c9d3ggckp1c1p0z3v8r0000gn/T/pytest-of-jfaleiro/pytest-138/test_dask_intermittent_error0/part.0.parquet
[我们考虑过依赖简单的追加并在检索后弄清楚顺序,但这似乎命中了另一个bug,即:]]
def test_dask_prepend_as_append(tmp_path): df = pd.DataFrame(np.random.randn(100, 1), columns=['A'], index=pd.date_range('20130101', periods=100, freq='T')) dfs = np.array_split(df, 2) dd1 = dd.from_pandas(dfs[0], npartitions=1) dd2 = dd.from_pandas(dfs[1], npartitions=1) dd2.to_parquet(tmp_path) dd1.to_parquet(tmp_path, append=True) assert_frame_equal(df, dd.read_parquet(tmp_path).compute())
给予
ValueError: Appended divisions overlapping with previous ones.
建议将数据(大熊猫数据框)添加到镶木地板存储中现有的dask数据框的推荐方法是什么?例如,此测试间歇性地失败:将dask.dataframe导入为dd import ...
如果编写时避免使用“ _metadata”文件(将使用默认设置和pyarrow),则可以简单地重命名文件,以确保在glob列出时,前置分区先于其余分区。通常,Dask将以序列号0开始命名。