如何选择 pandas 数据框中的行百分比

Question

在 python 中，我有一些结构如下的数据框：

0 0 0 0
1 1 1 1
2 2 2 2
. . . .
n n n n

如何选择中间的 33% 行（由索引决定，而不是值）？

这是我尝试过的：

df.iloc[int(len(df)*0.33):int(len(df)*0.66)]

它确实有效，但感觉真的很混乱，更不用说强制整数了。

我想知道是否有一种更干净的方法来选择数据帧的百分比，因为到目前为止我在文档中找不到任何有用的命令。

Answer 1

您还可以在索引上使用numpy百分位数函数。当您的索引不是从零开始时，此方法也适用。

df[(df.index>np.percentile(df.index, 33)) & (df.index<=np.percentile(df.index, 66))]

Answer 2

编写一个函数来完成您的任务，即

def get_middle(df,percent):

    start = int(len(df)*percent)
    end = len(df) - start

    return df.iloc[start:end]

get_middle(df,0.33)

Answer 3

将数据分成 70:30 并尝试这个

percentage=round(len(df)/100*70) 
documents(train) = df.head(percentage)  
test=df.iloc[percentage:len(df),:]

Answer 4

为此，您需要“玩”数字并定义您想要的索引：

df.iloc[(len(df)// 3) : (len(df) - len(df)// 3), :]

或

df.iloc[(len(df)// 3) : (len(df)// 3 * 2), :]

在这些示例中，我定义了一个间隔，即

(len(df.index)// 3) : (len(df.index)// 3 * 2)

，它剪切了表的 1/3 到 2/3 之间的数据帧行。

Answer 5

使用 .Iloc 和百分比计算来选择 DataFrame 中中间 33% 的行的方法是准确且相当常见的。但是，如果您正在寻找一种更干净的方式来处理索引减少，而不需要手动处理排序转换，那么您可以通过使用整数部门来简化代码。

start_idx = len(df) * 33 // 100
end_idx = len(df) * 66 // 100
df_selected = df.iloc[start_idx:end_idx]

此技术通常将乘法的最终结果转换为整数，因此您可以避免使用 int() 进行特定转换。它使代码块更加整洁且更具可读性。如果您正在寻找额外的功能或灵活性，您还可以记住定义一个函数来概括从 DataFrame 中选择任何百分比变化。