我试图创建一个函数,它将在pandas数据框中创建一个新列,它会在字符串列中找出哪个子字符串并获取子字符串并将其用于新列。
问题是要查找的文本不会出现在变量x
中的相同位置
df = pd.DataFrame({'x': ["var_m500_0_somevartext","var_m500_0_vartextagain",
"varwithsomeothertext_0_500", "varwithsomext_m150_0_text"], 'x1': [4, 5, 6,8]})
finds = ["m500_0","0_500","m150_0"]
哪个finds
在给定的df["x"]
行
我已经制作了一个有效的功能,但是对于大型数据集来说非常慢
def pd_create_substring_var(df,new_var_name = "new_var",substring_list=["1"],var_ori="x"):
import re
df[new_var_name] = "na"
cols = list(df.columns)
for ix in range(len(df)):
for find in substring_list:
for m in re.finditer(find, df.iloc[ix][var_ori]):
df.iat[ix, cols.index(new_var_name)] = df.iloc[ix][var_ori][m.start():m.end()]
return df
df = pd_create_substring_var(df,"t",finds,var_ori="x")
df
x x1 t
0 var_m500_0_somevartext 4 m500_0
1 var_m500_0_vartextagain 5 m500_0
2 varwithsomeothertext_0_500 6 0_500
3 varwithsomext_m150_0_text 8 m150_0
可能不是最好的方法:
df['t'] = df['x'].apply(lambda x: ''.join([i for i in finds if i in x]))
现在:
print(df)
方法是:
x x1 t
0 var_m500_0_somevartext 4 m500_0
1 var_m500_0_vartextagain 5 m500_0
2 varwithsomeothertext_0_500 6 0_500
3 varwithsomext_m150_0_text 8 m150_0
现在,只需添加@ pythonjokeun的答案,您就可以:
df["t"] = df["x"].str.extract("(%s)" % '|'.join(finds))
要么:
df["t"] = df["x"].str.extract("({})".format('|'.join(finds)))
要么:
df["t"] = df["x"].str.extract("(" + '|'.join(finds) + ")")
这是否能满足您的需求?
finds = ["m500_0", "0_500", "m150_0"]
df["t"] = df["x"].str.extract(f"({'|'.join(finds)})")
我不知道您的数据集有多大,但您可以使用如下的地图功能:
def subset_df_test():
df = pandas.DataFrame({'x': ["var_m500_0_somevartext", "var_m500_0_vartextagain",
"varwithsomeothertext_0_500", "varwithsomext_m150_0_text"], 'x1': [4, 5, 6, 8]})
finds = ["m500_0", "0_500", "m150_0"]
df['t'] = df['x'].map(lambda x: compare(x, finds))
print df
def compare(x, finds):
for f in finds:
if f in x:
return f
df['x'].str.findall("|".join(finds))
0 [m500_0]
1 [m500_0]
2 [0_500]
3 [m150_0]
试试这个
df["t"] = df["x"].apply(lambda x: [i for i in finds if i in x][0])