[我看到的行为看起来像是Dask中的错误,但是我想确保自己没有做错什么。
我有一个称为labeled_texts
的Dask数据帧。它包含一个称为“文本”的列。我计算了一个称为label_rows
的Dask系列,其中包含布尔值,并且与labeled_texts
的长度相同。我用它索引到labeled_texts
,并从较小的数据框中得到“文本”列,如下所示。
labeled_text[label_rows]["text"].compute()
当我运行上述行时,我在Dask / Pandas代码中将KeyError: 'text'
降到最低。但是,以下命令有效]
labeled_text[label_rows].compute()["text"]
labeled_text[label_rows.compute()]["text"]
我认为所有三个命令应产生相同的结果,第一个不应引起错误。这是正确的吗?
[遗憾的是,我无法提出可以在此处发布的最小复制方案。该问题始终在一个特定群集上发生,但是在另一台计算机上运行相同的代码和数据也可以正常工作。 (这进一步使我认为这是一个Dask错误。)
没有更好的复制方案,我不希望有人能够为我解决这个问题。我只想确保我没有做错什么。
这里是完整的堆栈跟踪。
Traceback (most recent call last):
...my code that ultimately calls compute()...
File "/usr/local/lib/python3.6/site-packages/dask/base.py", line 175, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/usr/local/lib/python3.6/site-packages/dask/base.py", line 446, in compute
results = schedule(dsk, keys, **kwargs)
File "/usr/local/lib/python3.6/site-packages/distributed/client.py", line 2510, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/usr/local/lib/python3.6/site-packages/distributed/client.py", line 1812, in gather
asynchronous=asynchronous,
File "/usr/local/lib/python3.6/site-packages/distributed/client.py", line 753, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/usr/local/lib/python3.6/site-packages/distributed/utils.py", line 337, in sync
six.reraise(*error[0])
File "/usr/local/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/distributed/utils.py", line 322, in f
result[0] = yield future
File "/usr/local/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/usr/local/lib/python3.6/site-packages/distributed/client.py", line 1668, in _gather
six.reraise(type(exception), exception, traceback)
File "/usr/local/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/site-packages/dask/optimization.py", line 1059, in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/usr/local/lib/python3.6/site-packages/dask/core.py", line 149, in get
result = _execute_task(task, cache)
File "/usr/local/lib/python3.6/site-packages/dask/core.py", line 119, in _execute_task
return func(*args2)
File "/usr/local/lib/python3.6/site-packages/pandas/core/frame.py", line 2980, in __getitem__
indexer = self.columns.get_loc(key)
File "/usr/local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'text'
没有什么对我有利。如您所建议,我建议尝试提供一个最小的复制子。