我在这一方面还差得很远,所以我希望有人能指出我正确的方向。我认为这是一个优化问题,但是scipy.optimize以及它如何与纸浆配合使我感到困惑。而且,矩阵数学使我感到困惑。因此,这个问题确实让我无所适从。
问题陈述:
我有一个客户数据集。对于每个客户,我可以选择3个选项,也可以不选择任何一个。有4个选项。同样,对于每个客户,我都有一个数字分数,表明每个选择的“好”程度。您可以将这个值想象为the probability of the Choice to create a future sale
。# fake data for the internet data = {'customerid':[101,102,103,104,105,106,107,108,109,110], 'prob_CHOICEA':[0.00317,0.00629,0.00242,0.00253,0.00421,0.00414,0.00739,0.00549,0.00658,0.00852], 'prob_CHOICEB':[0.061,0.087,0.055,0.027,0.022,0.094,0.099,0.072,0.018,0.052], 'prob_CHOICEC':[0.024,0.013,0.091,0.047,0.071,0.077,0.067,0.046,0.077,0.044] } # Creates pandas DataFrame df = pd.DataFrame(data) df = df.reset_index(drop=True).set_index(['customerid'])
+------------+--------------+--------------+--------------+ | customerid | prob_CHOICEA | prob_CHOICEB | prob_CHOICEC | +------------+--------------+--------------+--------------+ | 101 | 0.00317 | 0.061 | 0.024 | | 102 | 0.00629 | 0.087 | 0.013 | | 103 | 0.00242 | 0.055 | 0.091 | | 104 | 0.00253 | 0.027 | 0.047 | | 105 | 0.00421 | 0.022 | 0.071 | | 106 | 0.00414 | 0.094 | 0.077 | | 107 | 0.00739 | 0.099 | 0.067 | | 108 | 0.00549 | 0.072 | 0.046 | | 109 | 0.00658 | 0.018 | 0.077 | | 110 | 0.00852 | 0.052 | 0.044 | +------------+--------------+--------------+--------------+
我首先为每个客户将这些元素组合成一个数组:
# combine all values into 1 array list_to_combine = ['prob_CHOICEA', 'prob_CHOICEB','prob_CHOICEC'] df['probs_A_B_C']= df[list_to_combine].values.tolist() df.drop(list_to_combine, axis=1, inplace=True)
+------------+-------------------------+ | customerid | probs_A_B_C | +------------+-------------------------+ | 101 | [0.00317, 0.061, 0.024] | | 102 | [0.00629, 0.087, 0.013] | | 103 | [0.00242, 0.055, 0.091] | | 104 | [0.00253, 0.027, 0.047] | | 105 | [0.00421, 0.022, 0.071] | | 106 | [0.00414, 0.094, 0.077] | | 107 | [0.00739, 0.099, 0.067] | | 108 | [0.00549, 0.072, 0.046] | | 109 | [0.00658, 0.018, 0.077] | | 110 | [0.00852, 0.052, 0.044] | +------------+-------------------------+
对于每个客户,我只有四种选择:
choices = [ [0,0,0], [1,0,0], [0,1,0], [0,0,1] ]
对于每个客户,我想为每个客户选择最佳选择。乍一看,这很容易-只需选择最高的数字即可。但是,一旦我开始添加约束,它就会让我震惊。
例如,如果我想为每个客户选择最佳选择,但约束条件是选择的选择之和= 5,该怎么办>>
+------------+-------------------------+-------------+ | customerid | probs_A_B_C | best_choice | +------------+-------------------------+-------------+ | 101 | [0.00317, 0.061, 0.024] | [0,0,0] | | 102 | [0.00629, 0.087, 0.013] | [0,1,0] | | 103 | [0.00242, 0.055, 0.091] | [0,0,1] | | 104 | [0.00253, 0.027, 0.047] | [0,0,0] | | 105 | [0.00421, 0.022, 0.071] | [0,0,0] | | 106 | [0.00414, 0.094, 0.077] | [0,1,0] | | 107 | [0.00739, 0.099, 0.067] | [0,1,0] | | 108 | [0.00549, 0.072, 0.046] | [0,0,0] | | 109 | [0.00658, 0.018, 0.077] | [0,0,1] | | 110 | [0.00852, 0.052, 0.044] | [0,0,0] | +------------+-------------------------+-------------+
我什至都没有弄清楚如何做到这一点,我只是为了说明的目的手动将它盯着。
理想情况下,我想同时添加多个约束:
关于从哪里开始的任何想法?
我在这一方面还差得很远,所以我希望有人能指出我正确的方向。我认为这是一个优化问题,但scipy.optimize及其适用性让我感到困惑。
您可以使用scipy.optimize.linprog
解决此线性优化问题。如文档所述,它需要将边界条件设置为矩阵乘积。有两种边界条件,形式为scipy.optimize.linprog
的不等式和等式A @ x <= b
。可以对问题建模如下:
A @ x == b
的长度为x
,其中N*C
是客户数量,N
是期权数量;它以线性布局表示每个自定义项的选择:C
。此问题可以使用线性编程(LP)来解决,但是最困难的部分不是您应该使用LP,它是将您的问题转换为 prob_CHOICEA prob_CHOICEB prob_CHOICEC
customerid
101 0.00317 0.061 0.024
102 0.00629 0.087 0.013
103 0.00242 0.055 0.091
104 0.00253 0.027 0.047
105 0.00421 0.022 0.071
106 0.00414 0.094 0.077
107 0.00739 0.099 0.067
108 0.00549 0.072 0.046
109 0.00658 0.018 0.077
110 0.00852 0.052 0.044
con: array([-1.30002675e-11])
fun: -0.3812999999903971
message: 'Optimization terminated successfully.'
nit: 7
slack: array([1.00000000e+00, 7.99305067e-11, 1.47325485e-11, 1.00000000e+00,
1.00000000e+00, 2.49527066e-11, 2.42738052e-11, 5.84235438e-10,
4.23596713e-11, 5.77714543e-11, 8.80984175e-12, 1.46305190e-11])
status: 0
success: True
x: array([2.89971936e-10, 1.32732722e-11, 6.97732845e-12, 1.00000000e+00,
3.28055311e-10, 5.72702383e-12, 1.80418885e-11, 4.61391860e-12,
1.00000000e+00, 2.01674011e-10, 4.58311340e-12, 1.29599793e-11,
2.95298295e-10, 4.34109315e-12, 1.21776975e-11, 3.39951283e-11,
1.00000000e+00, 2.55262044e-10, 4.94703751e-11, 1.00000000e+00,
1.57932544e-11, 9.99999999e-01, 2.21487598e-11, 1.33679145e-11,
2.30514296e-10, 3.91129933e-12, 1.00000000e+00, 1.00000000e+00,
8.19015577e-12, 1.07293976e-11])
Choices:
[[0 0 0]
[1 0 0]
[0 0 1]
[0 0 0]
[0 0 0]
[0 1 0]
[0 1 0]
[1 0 0]
[0 0 1]
[1 0 0]]
问题,我将向您展示如何做到这一点。在继续之前,我将更改您为简化起见而提供的示例数据(由于生成了大量变量),因此,假设我们具有以下输入数据:
LP-optimization
假设输入问题的大小为N,其中N代表选择的数量: