我是Python初学者。我在这个问题上尝试过各种阴谋,但都失败了。我正在思考 SAS 的工作原理,不习惯 Python。
我想做的是找到第一辆新车的列索引,假设表中的车辆按从最新到最旧的顺序排列,并使用它来选择具有第一个新名称、第一个新购买日期的其他列等等
这是我尝试在 SAS 伪代码中执行的操作。
SAS 具有用户定义的临时数组。我可以像这样创建一个数组:
数组 newused newused_status1 – newused_status5 数组铭牌 car1 – car5
然后在循环中使用该数组。
重新使用新的(i) 从数组 newused 中查找第一辆新购买车辆的索引并将其分配给变量 结束循环
第一个新铭牌=铭牌(第一个新车的索引)
我在Python中找不到非常相似的方法。
我试图做的是在 newused_status1 – newused_status 5 中查找 NEW 的第一次出现,进行一些文本处理以对与第一辆新车关联的数字进行子字符串化,并创建一个新值 – nameplate#of first newused, - nameplate1 或 nameplate2 ,以第一辆新车为准。
我无法做到的是使用列名值来进行分配; first_new_nameplate = mydataframe[第一个新使用的铭牌#]
这是我的初学者Python代码:
import pandas as pd
import numpy as np
# create a DataFrame
df = pd.DataFrame([[12345,"Name1","Name1","Name2","","","A","U","N","N","",""],
[45678,"Name3","Name1","Name2","","","S","N","N","N","",""],
[45679,"Name2","Name2","Name2","","","S","U","U","U","",""],
[98765,"Name3","Name2","Name2","","","","CPO","U","N","",""]],
columns=("ID","Car1","Car2","Car3","Car4","Car5","Income",
"NU1","NU2","NU3","NU4","NU5") )
nu_array = df[["NU1","NU2","NU3","NU4","NU5"]]
nameplate_array = df[["Car1","Car2","Car3","Car4","Car5"]]
#find name of first new vehicle column – create a new column – value in the new column is the name of the column with the first new car name
df['first_nu_col_name'] = df.eq("N").T.idxmax()
**# ONE PROBLEM HERE - this returns a column name even for row 3 where all vehicles are "U" (used)**
df['col_num'] = df['first_nu_col_name'].str[2:]
df['first_nu_veh_col_name'] = "Car" + df['col_num']
#My code does return the correct column name, except where there is no new vehicle - then it returns a name that is incorrect
#I tried iterating and failed (i.e. using a loop as I described in the SAS pseudo code) From what I’m reading iterating in a loop is not ‘pythonic’ and vectoring is much better, however
#I haven’t run across anything from googling that has worked and nothing (I could understand) regarding a vector solution
我需要做什么:
我尝试在数据帧中的每一行中应用列索引或感兴趣的名称,以在其他列中查找我想要的值 - 结果是错误消息。使用像index()这样的函数或尝试使用first_nu_veh_col_name作为列名不起作用。
使用 idxmax(),然后使用迭代遍历该系列来检查最大值是否为“N”。
max_nu = (df[['NU1', 'NU2', 'NU3', 'NU4', 'NU5']] == 'N').idxmax(axis = 'columns')
df['first_nu_veh_col_name'] = ''
for index, col in max_nu.items():
if df.loc[index, col] == 'N':
df.loc[index, 'first_nu_veh_col_name'] = 'Car' + col[2:]
首先我们找到第一辆新车的索引,如果没有则返回
None
:
df['first_new_index'] = nu_array.apply(lambda row: next((i+1 for i, val in enumerate(row)
if val == 'N'),
None),
axis=1)
然后找到第一辆新车的铭牌,如果没有数值则返回
None
。
df['first_new_nameplate'] = df.apply(lambda row: row[f'Car{int(row["first_new_index"])}']
if pd.notnull(row["first_new_index"])
else None,
axis=1)
当连续没有
True
时,idxmax
将返回第一个False
的索引。
idxmax
与 where
和 any
组合起来,当连续没有 True
时得到 NaN:
m = df.eq('N')
df['first_nu_col_name'] = m.idxmax(axis=1).where(m.any(axis=1))
df['col_num'] = df['first_nu_col_name'].str[2:]
df['first_nu_veh_col_name'] = "Car" + df['col_num']
输出:
ID Car1 Car2 Car3 Car4 Car5 Income NU1 NU2 NU3 NU4 NU5 first_nu_col_name col_num first_nu_veh_col_name
0 12345 Name1 Name1 Name2 A U N N NU2 2 Car2
1 45678 Name3 Name1 Name2 S N N N NU1 1 Car1
2 45679 Name2 Name2 Name2 S U U U NaN NaN NaN
3 98765 Name3 Name2 Name2 CPO U N NU3 3 Car3
中间体:
# m
ID Car1 Car2 Car3 Car4 Car5 Income NU1 NU2 NU3 \
0 False False False False False False False False True True
1 False False False False False False False True True True
2 False False False False False False False False False False
3 False False False False False False False False False True
NU4 NU5 first_nu_col_name col_num first_nu_veh_col_name
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
# m.idxmax(axis=1)
0 NU2
1 NU1
2 ID
3 NU3
dtype: object
# m.any(axis=1)
0 True
1 True
2 False
3 True
dtype: bool
# m.idxmax(axis=1).where(m.any(axis=1))
0 NU2
1 NU1
2 NaN
3 NU3
dtype: object