我有一个看起来像
import pandas as pd
data = {
"Race_ID": [2,2,2,2,2,5,5,5,5,5,5],
"Student_ID": [1,2,3,4,5,9,10,2,3,6,5],
"theta": [8,9,2,12,4,5,30,3,2,1,50]
}
df = pd.DataFrame(data)
的pandas数据框,我想通过以下方法创建一个新列
df['feature']
:对于每个
Race_ID
,假设
Student_ID
等于i,那么我们将特征定义为
def f(thetak, thetaj, thetai, *theta):
prod = 1;
for t in theta:
prod = prod * t;
return ((thetai + thetaj) / (thetai + thetaj + thetai * thetak)) * prod
其中 k,j,l 是同一个
Student_ID
中的
Race_ID
,使得 k =/= i, j=/=i,k, l=/=k,j,i 且 theta_i 是| ||与
theta
等于i。例如,对于
Student_ID
=2,
Race_ID
=1,我们的特征等于
Student_ID
f(2,3,1,4,5)+f(2,3,1,5,4)+ f(2,4,1,3,5)+f(2,4,1,5,3)+f(2,5,1,3,4)+f(2,5,1,4,3 )+f(3,2,1,4,5)+f(3,2,1,5,4)+f(3,4,1,2,5)+f(3,4,1,5 ,2)+f(3,5,1,2,4)+f(3,5,1,4,2)+f(4,2,1,3,5)+f(4,2,1 ,5,3)+f(4,3,1,2,5)+f(4,3,1,5,2)+f(4,5,1,2,3)+f(4,5 ,1,3,2)+f(5,2,1,3,4)+f(5,2,1,4,3)+f(5,3,1,2,4)+f(5 ,3,1,4,2)+f(5,4,1,2,3)+f(5,4,1,3,2)
等于 299.1960138012742。
但是作为 1很快我们就意识到,总和中的项数随着比赛中学生的数量呈超指数增长:如果一场比赛中有 n 名学生,那么就有 (n-1) 个!
总和中的项。幸运的是,由于 f 的对称性,我们可以通过注意以下事项将项数减少到仅仅 (n-1)(n-2) 项:
令 i, j,k 被给定,1,2,3(例如为了缘故)与 i,j,k 不同(即 1,2,3 在 *arg 中)。那么 f(k,j,i,1,2,3) = f(k,j,i,1,3,2) = f(k,j,i,2,1,3) = f(k, j,i,2,3,1) = f(k,j,i,3,1,2) = f(k,j,i,3,2,1)。因此,如果我们只计算任何一项,然后将其乘以 (n-3),我们就可以减少项数!
因此,例如,对于
=5,
Race_ID
=9,则有已经有 5!=120 项求和,但是使用上述对称性,我们只需要对 5x4 = 20 项求和(k 有 5 个选择,i 有 4 个选择,l 有 1 个(非唯一选择)),即| ||f(2,3,9,5,6,10)+f(2,5,9,3,6,10)+f(2,6,9,3,5,10)+f(2 ,10,9,3,5,6)+f(3,2,9,5,6,10)+f(3,5,9,3,6,10)+f(3,6,9, 2,5,10)+f(3,10,9,2,5,6)+f(5,2,9,3,6,10)+f(5,3,9,2,6,10 )+f(5,6,9,2,3,10)+f(5,10,9,2,3,6)+f(6,2,9,3,5,10)+f(6 ,3,9,2,5,10)+f(6,5,9,2,3,10)+f(6,10,9,2,3,5)+f(10,2,9, 3,5,6)+f(10,3,9,2,5,6)+f(10,5,9,2,3,6)+f(10,6,9,2,3,5 )
Student_ID
第 5 场比赛中学生 9 的特征将等于上述总和乘以 3! = 53588.197759
所以问题是:我如何编写上述数据帧的总和?我已经手动计算了这些特征以进行检查,所需的结果如下所示:
非常感谢。
So by question is: how do i write the sum for the above dataframe? I have computed the features by hand for checking and the desired outcome looks like:
import pandas as pd
data = {
"Race_ID": [2,2,2,2,2,5,5,5,5,5,5],
"Student_ID": [1,2,3,4,5,9,10,2,3,6,5],
"theta": [8,9,2,12,4,5,30,3,2,1,50],
"feature": [299.1960138012742, 268.93506341257876, 634.7909309816431, 204.18901708653254, 483.7234700875771, 53588.197759, 9395.539167178009, 78005.26224935807, 92907.8753942894, 118315.38359654899, 5600.243276203378]
}
df = pd.DataFrame(data)
Thank you so much.
import pandas as pd
from itertools import permutations
def f(thetak, thetaj, thetai, *theta):
prod = 1
for t in theta:
prod = prod * t
return ((thetai + thetaj) / (thetai + thetaj + thetai * thetak)) * prod
def calculate_feature(df):
features = []
for race_id in df['Race_ID'].unique():
race_df = df[df['Race_ID'] == race_id]
n = len(race_df)
for i in range(n):
thetai = race_df['theta'].iloc[i]
student_i = race_df['Student_ID'].iloc[i]
feature = 0
for j, k in permutations(range(n), 2):
if j != i and k != i and k != j:
thetaj = race_df['theta'].iloc[j]
thetak = race_df['theta'].iloc[k]
other_thetas = race_df['theta'].iloc[[l for l in range(n) if l not in [i, j, k]]].values
feature += f(thetak, thetaj, thetai, *other_thetas)
features.append(feature * (n-3)) # Multiply by (n-3)!
return features
data = {
"Race_ID": [2,2,2,2,2,5,5,5,5,5,5],
"Student_ID": [1,2,3,4,5,9,10,2,3,6,5],
"theta": [8,9,2,12,4,5,30,3,2,1,50]
}
df = pd.DataFrame(data)
df['feature'] = calculate_feature(df)
print(df)
This code defines two functions:
-
f(thetak, thetaj, thetai, *theta)
: This function calculates the individual term of your sum, as defined in your question. -
calculate_feature(df)
: This function iterates through the dataframe and calculates the feature for eachStudent_ID
within eachRace_ID
.-
For each
Race_ID
, it iterates through each student and calculates the sum off
over all possible unique combinations ofj
andk
, as you described in your optimization. -
It then multiplies the sum by
(n-3)!
to account for the symmetry off
.
-
For each
The final result is stored in a new column called
'feature'
in the original dataframe.