koalas groupby-> apply返回'无法插入“键”,已经存在'

问题描述 投票:0回答:1

我一直在努力解决这个问题,但未能解决,我得到了当前的数据框:

import databricks.koalas as ks

x = ks.DataFrame.from_records(
{'ds': {0: Timestamp('2018-10-06 00:00:00'),
  1: Timestamp('2017-06-08 00:00:00'),
  2: Timestamp('2018-10-22 00:00:00'),
  3: Timestamp('2017-02-08 00:00:00'),
  4: Timestamp('2019-02-03 00:00:00'),
  5: Timestamp('2019-02-26 00:00:00'),
  6: Timestamp('2017-04-15 00:00:00'),
  7: Timestamp('2017-07-02 00:00:00'),
  8: Timestamp('2017-04-04 00:00:00'),
  9: Timestamp('2017-03-20 00:00:00'),
  10: Timestamp('2018-06-09 00:00:00'),
  11: Timestamp('2017-01-15 00:00:00'),
  12: Timestamp('2018-05-07 00:00:00'),
  13: Timestamp('2018-01-17 00:00:00'),
  14: Timestamp('2017-07-11 00:00:00'),
  15: Timestamp('2018-12-17 00:00:00'),
  16: Timestamp('2018-12-05 00:00:00'),
  17: Timestamp('2017-05-22 00:00:00'),
  18: Timestamp('2017-08-13 00:00:00'),
  19: Timestamp('2018-05-21 00:00:00')},
 'store': {0: 81,
  1: 128,
  2: 81,
  3: 128,
  4: 25,
  5: 128,
  6: 11,
  7: 124,
  8: 43,
  9: 25,
  10: 25,
  11: 124,
  12: 124,
  13: 128,
  14: 81,
  15: 11,
  16: 124,
  17: 11,
  18: 167,
  19: 128},
 'stock': {0: 1,
  1: 236,
  2: 3,
  3: 9,
  4: 36,
  5: 78,
  6: 146,
  7: 20,
  8: 12,
  9: 12,
  10: 15,
  11: 25,
  12: 10,
  13: 7,
  14: 0,
  15: 230,
  16: 80,
  17: 6,
  18: 110,
  19: 8},
 'sells': {0: 1.0,
  1: 17.0,
  2: 1.0,
  3: 2.0,
  4: 1.0,
  5: 2.0,
  6: 7.0,
  7: 1.0,
  8: 1.0,
  9: 1.0,
  10: 2.0,
  11: 1.0,
  12: 1.0,
  13: 1.0,
  14: 1.0,
  15: 1.0,
  16: 1.0,
  17: 3.0,
  18: 2.0,
  19: 1.0}}
)

以及我要在groupby中使用的此功能-应用:

import numpy as np

def compute_indicator(df):
  return (
    df.copy()
    .assign(
      indicator=lambda x: x['a'] < np.percentile(x['b'], 80)
    )
    .astype(int)
    .fillna(1)
  )

其中df表示为熊猫DataFrame。如果我使用熊猫进行分组申请,则代码将按预期执行:

import pandas as pd
# This runs
a = pd.DataFrame.from_dict(x.to_dict()).groupby('store').apply(compute_indicator)

但是尝试在考拉上运行相同代码时,出现以下错误:ValueError: cannot insert store, already exists

x.groupby('store').apply(compute_indicator)
# ValueError: cannot insert store, already exists

我无法使用compute_indicator中的类型注释,因为某些列是不固定的(它们随数据框一起移动,意味着将由其他转换使用)。

我应该怎么做才能在考拉中运行代码?

python pandas databricks spark-koalas
1个回答
0
投票

关于Koalas 0.29.0,当koalas.DataFrame.groupby(keys).apply(f)首次在无类型的函数f上运行时,它必须推断模式,然后运行pandas.DataFrame.head(n).groupby(keys).apply(f)。问题是大熊猫apply接收到带有groupby键作为索引和列的数据帧作为参数(请参见此issue)。

然后将pandas.DataFrame.head(h).groupby(keys).apply(f)的结果转换为koalas.DataFrame,因此,如果f不删除keys列,则此转换会由于列名重复而导致异常(请参阅issue

© www.soinside.com 2019 - 2024. All rights reserved.