循环遍历DataFrame中的行的子集

Question

我尝试使用函数计算循环遍历DataFrame的行，这是系列中最常用的元素。当我手动为其提供一系列时，该功能非常有效：

# Create DataFrame
df = pd.DataFrame({'a' : [1, 2, 1, 2, 1, 2, 1, 1],
              'b' : [1, 1, 2, 1, 1, 1, 2, 2],
              'c' : [1, 2, 2, 1, 2, 2, 2, 1]})

# Create function calculating most frequent element
from collections import Counter

def freq_value(series):
    return Counter(series).most_common()[0][0]

# Test function on one row
freq_value(df.iloc[1])

# Another test
freq_value((df.iloc[1, 0], df.iloc[1, 1], df.iloc[1, 2]))

通过这两项测试，我得到了理想的结果。但是，当我尝试在循环中通过DataFrame行应用此函数并将结果保存到新列时，我得到一个错误"'Series' object is not callable", 'occurred at index 0'。产生错误的行如下：

# Loop trough rows of a dataframe and write the result into new column
df['result'] = df.apply(lambda row: freq_value((row('a'), row('b'), row('c'))), axis = 1)

row()函数中的apply()究竟是如何工作的？它不应该从列'a'，'b'，'c'提供给我的freq_value()函数值吗？

Answer 1

row不是你的lambda中的函数，所以括号不合适，相反，你应该使用__getitem__方法或loc访问器来访问值。前者的语法糖是[]：

df['result'] = df.apply(lambda row: freq_value((row['a'], row['b'], row['c'])), axis=1)

使用loc替代方案：

def freq_value_calc(row):
    return freq_value((row.loc['a'], row.loc['b'], row.loc['c']))

要明白为什么会出现这种情况，将lambda重写为命名函数会有所帮助：

def freq_value_calc(row):
    print(type(row))  # useful for debugging
    return freq_value((row['a'], row['b'], row['c']))

df['result'] = df.apply(freq_value_calc, axis=1)

运行这个，你会发现row的类型为<class 'pandas.core.series.Series'>，即如果你使用axis=1则由列标签索引的系列。要访问给定标签的系列中的值，您可以使用__getitem__ / []语法或loc。

Answer 2

@jpp的答案解决了如何应用自定义函数，但您也可以使用df.mode和axis=1获得所需的结果。这将避免使用apply，并且仍会为您提供每行最常见值的列。

df['result'] = df.mode(1)

>>> df
   a  b  c  result
0  1  1  1       1
1  2  1  2       2
2  1  2  2       2
3  2  1  1       1
4  1  1  2       1
5  2  1  2       2
6  1  2  2       2
7  1  2  1       1

Answer 3

df['CommonValue'] = df.apply(lambda x: x.mode()[0], axis = 1)

循环遍历DataFrame中的行的子集

问题描述投票：1回答：3

3个回答

最新问题

循环遍历DataFrame中的行的子集

问题描述 投票：1回答：3

3个回答

最新问题

问题描述投票：1回答：3