Python - Pandas - 在数据帧行中，查找多个列中第一次出现某个值的列索引

Question

我是Python初学者。我在这个问题上尝试过各种阴谋，但都失败了。我正在思考 SAS 的工作原理，不习惯 Python。

我想做的是找到第一辆新车的列索引，假设表中的车辆按从最新到最旧的顺序排列，并使用它来选择具有第一个新名称、第一个新购买日期的其他列等等

这是我尝试在 SAS 伪代码中执行的操作。

SAS 具有用户定义的临时数组。我可以像这样创建一个数组：

数组 newused newused_status1 – newused_status5 数组铭牌 car1 – car5

然后在循环中使用该数组。

重新使用新的(i) 从数组 newused 中查找第一辆新购买车辆的索引并将其分配给变量结束循环

第一个新铭牌=铭牌（第一个新车的索引）

我在Python中找不到非常相似的方法。

我试图做的是在 newused_status1 – newused_status 5 中查找 NEW 的第一次出现，进行一些文本处理以对与第一辆新车关联的数字进行子字符串化，并创建一个新值 – nameplate#of first newused, - nameplate1 或 nameplate2 ，以第一辆新车为准。

我无法做到的是使用列名值来进行分配； first_new_nameplate = mydataframe[第一个新使用的铭牌#]

这是我的初学者Python代码：

import pandas as pd
import numpy as np


# create a DataFrame
df = pd.DataFrame([[12345,"Name1","Name1","Name2","","","A","U","N","N","",""], 
               [45678,"Name3","Name1","Name2","","","S","N","N","N","",""],
               [45679,"Name2","Name2","Name2","","","S","U","U","U","",""],
               [98765,"Name3","Name2","Name2","","","","CPO","U","N","",""]], 
           columns=("ID","Car1","Car2","Car3","Car4","Car5","Income",
                    "NU1","NU2","NU3","NU4","NU5") )

nu_array = df[["NU1","NU2","NU3","NU4","NU5"]] 
nameplate_array = df[["Car1","Car2","Car3","Car4","Car5"]] 



#find name of first new vehicle column – create a new column – value in the new column is the     name of the column with the first new car name

df['first_nu_col_name'] = df.eq("N").T.idxmax()  
**# ONE PROBLEM HERE - this returns a column name even for row 3 where all vehicles are "U" (used)**

df['col_num'] = df['first_nu_col_name'].str[2:]
df['first_nu_veh_col_name'] = "Car" + df['col_num']

#My code does return the correct column name, except where there is no new vehicle - then it returns a name that is incorrect
#I tried iterating and failed (i.e. using a loop as I described in the SAS pseudo code)  From what I’m reading iterating in a loop is not ‘pythonic’ and vectoring is much better, however
#I haven’t run across anything from googling that has worked and nothing (I could understand) regarding a vector solution

我需要做什么：

当行中没有新车辆时，正确返回新车的值
通过将第一辆新车的位置应用到以相同方式排列的其他描述符（即purchase_date1-5、price1-5等）来创建新变量 - 作为索引或列名称

我尝试在数据帧中的每一行中应用列索引或感兴趣的名称，以在其他列中查找我想要的值 - 结果是错误消息。使用像index()这样的函数或尝试使用first_nu_veh_col_name作为列名不起作用。

Answer 1

使用 idxmax()，然后使用迭代遍历该系列来检查最大值是否为“N”。

max_nu = (df[['NU1', 'NU2', 'NU3', 'NU4', 'NU5']] == 'N').idxmax(axis = 'columns')
df['first_nu_veh_col_name'] = ''

for index, col in max_nu.items():
    if df.loc[index, col] == 'N':
        df.loc[index, 'first_nu_veh_col_name'] = 'Car' + col[2:]

Answer 2

首先我们找到第一辆新车的索引，如果没有则返回

None

:

df['first_new_index'] = nu_array.apply(lambda row: next((i+1 for i, val in enumerate(row) 
                                                        if val == 'N'), 
                                                       None), 
                                      axis=1)

然后找到第一辆新车的铭牌，如果没有数值则返回

None

。

df['first_new_nameplate'] = df.apply(lambda row: row[f'Car{int(row["first_new_index"])}'] 
                                                 if pd.notnull(row["first_new_index"]) 
                                                 else None, 
                                     axis=1)

Answer 3

当连续没有

True

时，

idxmax

将返回第一个

False

的索引。

避免这种情况的矢量方法是将

idxmax

与

where

和

any

组合起来，当连续没有

True

时得到 NaN：

m = df.eq('N')
df['first_nu_col_name'] = m.idxmax(axis=1).where(m.any(axis=1))
df['col_num'] = df['first_nu_col_name'].str[2:]
df['first_nu_veh_col_name'] = "Car" + df['col_num']

输出：

      ID   Car1   Car2   Car3 Car4 Car5 Income  NU1 NU2 NU3 NU4 NU5 first_nu_col_name col_num first_nu_veh_col_name
0  12345  Name1  Name1  Name2                A    U   N   N                       NU2       2                  Car2
1  45678  Name3  Name1  Name2                S    N   N   N                       NU1       1                  Car1
2  45679  Name2  Name2  Name2                S    U   U   U                       NaN     NaN                   NaN
3  98765  Name3  Name2  Name2                   CPO   U   N                       NU3       3                  Car3

中间体：

# m
      ID   Car1   Car2   Car3   Car4   Car5  Income    NU1    NU2    NU3  \
0  False  False  False  False  False  False   False  False   True   True   
1  False  False  False  False  False  False   False   True   True   True   
2  False  False  False  False  False  False   False  False  False  False   
3  False  False  False  False  False  False   False  False  False   True   

     NU4    NU5  first_nu_col_name  col_num  first_nu_veh_col_name  
0  False  False              False    False                  False  
1  False  False              False    False                  False  
2  False  False              False    False                  False  
3  False  False              False    False                  False  

# m.idxmax(axis=1)
0    NU2
1    NU1
2     ID
3    NU3
dtype: object

# m.any(axis=1)
0     True
1     True
2    False
3     True
dtype: bool

# m.idxmax(axis=1).where(m.any(axis=1))
0    NU2
1    NU1
2    NaN
3    NU3
dtype: object

Python - Pandas - 在数据帧行中，查找多个列中第一次出现某个值的列索引

问题描述投票：0回答：3

3个回答

最新问题

Python - Pandas - 在数据帧行中，查找多个列中第一次出现某个值的列索引

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3