我需要对每组进行回归,然后将系数传递到新列
b
。这是我的代码:
自定义功能:
def simplereg(g, y, x):
try:
xvar = sm.add_constant(g[x])
yvar = g[y]
model = sm.OLS(yvar, xvar, missing='drop').fit()
b = model.params[x]
return pd.Series([b*100]*len(g))
except Exception as e:
return pd.Series([np.NaN]*len(g))
创建样本数据:
import pandas as pd
import numpy as np
# Setting the parameters
gvkeys = ['A', 'B', 'C', 'D'] # Possible values for gvkey
years = np.arange(2000, 2020) # Possible values for year
# Number of rows for each gvkey, ensuring 5-7 observations for each
num_rows_per_gvkey = np.random.randint(5, 8, size=len(gvkeys))
total_rows = sum(num_rows_per_gvkey)
# Creating the DataFrame
np.random.seed(0) # For reproducibility
df = pd.DataFrame({
'gvkey': np.repeat(gvkeys, num_rows_per_gvkey),
'year': np.random.choice(years, size=total_rows),
'y': np.random.rand(total_rows),
'x': np.random.rand(total_rows)
})
df.sort_values(by='year', ignore_index=True, inplace=True) # make sure if the code can handle even data without sort
运行
groupby
代码:
df['b'] = df.groupby('gvkey').apply(simplereg, y='y', x='x')
但是,代码返回列“b”且全部不适用 请问问题出在哪里以及如何解决?
谢谢你
显然是的。此外,您不会显示任何错误消息。如果您的函数返回
NaN
值,则可能是因为您的代码引发了异常。
如果我
import statsmodels.api as sm
并进行一些小的更改,你的代码对我来说效果很好:
import statsmodels.api as sm
def simplereg(g, y, x):
try:
xvar = sm.add_constant(g[x])
yvar = g[y]
model = sm.OLS(yvar, xvar, missing='drop').fit()
b = model.params[x]
return pd.Series([b*100]*len(g), index=g.index) # reindex here
except Exception as e:
print(e) # At least, print exception here
return pd.Series([np.NaN]*len(g), index=g.index) # reindex here
# drop the first group level (gvkey) to align indexes
df['b'] = df.groupby('gvkey').apply(simplereg, y='y', x='x').droplevel('gvkey')
输出:
>>> df
gvkey year y x b
0 A 2000 0.799159 0.128926 -55.326856
1 B 2001 0.774234 0.253292 -68.351309
2 A 2003 0.461479 0.315428 -55.326856
3 A 2003 0.780529 0.363711 -55.326856
4 B 2004 0.521848 0.208877 -68.351309
5 C 2005 0.612096 0.656330 6.342994
6 D 2005 0.060225 0.096098 36.320231
7 B 2006 0.414662 0.161310 -68.351309
8 C 2006 0.456150 0.466311 6.342994
9 A 2007 0.118274 0.570197 -55.326856
10 C 2007 0.568434 0.244426 6.342994
11 C 2008 0.943748 0.196582 6.342994
12 D 2009 0.681820 0.368725 36.320231
13 B 2009 0.639921 0.438602 -68.351309
14 A 2012 0.870012 0.670638 -55.326856
15 B 2012 0.264556 0.653108 -68.351309
16 C 2013 0.616934 0.138183 6.342994
17 C 2014 0.018790 0.158970 6.342994
18 A 2015 0.978618 0.210383 -55.326856
19 D 2015 0.666767 0.976459 36.320231
20 D 2016 0.437032 0.097101 36.320231
21 C 2017 0.617635 0.110375 6.342994
22 B 2018 0.944669 0.102045 -68.351309
23 B 2019 0.143353 0.988374 -68.351309
24 D 2019 0.359508 0.820993 36.320231
25 D 2019 0.697631 0.837945 36.320231