如何使用另一列的模式正确地将这些NaN值归为？

Question

我正在学习如何处理数据集中的缺失值。我有一张约100万条表的桌子。我正在尝试处理少量的缺失值。

我的数据涉及自行车共享系统，我的缺失值是开始和结束位置。

数据：缺少起始站，只有7个值

数据：缺少结束站，共24个值

我希望在两种情况下用“对面”站的模式填充NaN。例如，对于start_station==21，我想看看最常见的end_station是什么，并用它来填补我的缺失值。例如。 df.loc[df['start_station'] == 21].end_station.mode()

我尝试用一个函数来实现这个目的：

def inpute_end_station(df):
    for index, row in df.iterrows():    
        if pd.isnull(df.loc[index, 'end_station']):

            start_st = df.loc[index, 'start_station']
            mode = df.loc[df['start_station'] == start_st].end_station.mode()
            df.loc[index, 'end_station'].fillna(mode, inplace=True)

最后一行抛出一个AttributeError: 'numpy.float64' object has no attribute 'fillna'。如果相反我只是使用df.loc[index, 'end_station'] = mode我得到ValueError: Incompatible indexer with Series。

我接近这个吗？我理解修改你在熊猫中迭代的东西是不好的做法，那么改变start_station和end_station列并用互补站的相应模式替换NaNs的正确方法是什么？

Answer 1

在我看来，当你想在像这样的pandas中迭代一列时，最好的做法是使用apply()函数。

对于这种特殊情况，我建议采用以下方法，如下面的示例数据所示。我没有太多使用mode()方法的经验，所以我使用value_counts()方法结合first_valid_index()方法来确定模式值。

# import pandas
import pandas as pd

# make a sample data
list_of_rows = [
  {'start_station': 1, 'end_station': 1},
  {'start_station': None, 'end_station': 1},
  {'start_station': 1, 'end_station': 2},
  {'start_station': 1, 'end_station': 3},
  {'start_station': 2, 'end_station': None},
  {'start_station': 2, 'end_station': 3},
  {'start_station': 2, 'end_station': 3},
]

# make a pandas data frame
df = pd.DataFrame(list_of_rows)

# define a function
def fill_NaNs_in_end_station(row):
    if pd.isnull(row['end_station']):
        start_station = row['start_station']
        return df[df['start_station']==start_station].end_station.value_counts().first_valid_index()
    return row['end_station']

# apply function to dataframe
df['end_station'] = df.apply(lambda row: fill_NaNs_in_end_station(row), axis=1)

如何使用另一列的模式正确地将这些NaN值归为？

问题描述投票：0回答：1

1个回答

最新问题

如何使用另一列的模式正确地将这些NaN值归为？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1