从多个平均向量中找出欧几里德距离

问题描述 投票:1回答:2

这就是我想要做的事情 - 我能够做到第1步到第4步。需要步骤5的帮助

基本上对于每个数据点,我想找到基于列y的所有平均向量的欧几里德距离

  1. 拿数据
  2. 将非数字列分开
  3. 通过y列找到平均向量
  4. 保存手段
  5. 基于y值从每行中减去每个均值向量
  6. 每列正方形
  7. 添加所有列
  8. 连接回数值数据集,然后连接非数字列
import pandas as pd

data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean()

对于means的每一行,从df_numeric的每一行中减去该行。然后取输出中每列的平方,然后为每一行添加所有列。然后将这些数据加入df_numericdf_non_numeric

-------------- UPDATE1

添加代码如下。我的问题已经改变,最新的问题也在最后。

def calculate_distance(row):
    return (np.sum(np.square(row-means.head(1)),1))

def calculate_distance2(row):
    return (np.sum(np.square(row-means.tail(1)),1))


df_numeric2=df_numeric.drop("class",1)
#np.sum(np.square(df_numeric2.head(1)-means.head(1)),1)
df_numeric2['distance0']= df_numeric.apply(calculate_distance, axis=1)
df_numeric2['distance1']= df_numeric.apply(calculate_distance2, axis=1)

print(df_numeric2)

final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]

任何人都可以确认这些是实现结果的正确方法吗?我主要关注最后两个陈述。第二个语句会进行正确的连接吗?最后的声明会分配原始的class吗?我想确认python不会以随机顺序执行concat和类赋值,并且python将保持行出现的顺序

final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
python pandas euclidean-distance
2个回答
2
投票

我想这就是你想要的

import pandas as pd
import numpy as np
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float) 
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
# Make df_non_numeric a copy and not a view
df_non_numeric=df.select_dtypes(exclude='number').copy()

# Subtract mean (calculated using the transform function which preserves the 
# number of rows) for each class  to create distance to mean
df_dist_to_mean =  df_numeric[['Age', 'weight']] - df_numeric[['Age', 'weight', 'class']].groupby('class').transform('mean')
# Finally calculate the euclidean distance (hypotenuse)
df_non_numeric['euc_dist'] = np.hypot(df_dist_to_mean['Age'], df_dist_to_mean['weight'])
df_non_numeric['class'] = df_numeric['class']
# If you want a separate dataframe named 'final' with the end result
df_final = df_non_numeric.copy()
print(df_final)

写这个甚至可能更密集,但这样你就会看到最新情况。


1
投票

我确信有更好的方法可以做到这一点,但我依靠类重复,并按照确切的步骤。

  1. 将'class'指定为索引。
  2. 旋转,以便'class'在列中。
  3. 执行与df_numeric对应的平均值的操作
  4. 平衡了价值观。
  5. 总结行。
  6. 将数据帧重新连接在一起。 data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]] df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float) #print (df) df_numeric=df.select_dtypes(include='number')#, exclude=None)[source] df_non_numeric=df.select_dtypes(exclude='number') means=df_numeric.groupby('class').mean().T import numpy as np # Changed index df_numeric.index = df_numeric['class'] df_numeric.drop('class' , axis = 1 , inplace = True) # Rotated the Numeric data sideways so the class was in the columns df_numeric = df_numeric.T #Iterated through the values in means and seen which df_Numeric values matched store = [] # Assigned an empty array for j in means: sto = df_numeric[j] if type(sto) == type(pd.Series()): # If there is a single value it comes out as a pd.Series type sto = sto.to_frame() # Need to convert ot dataframe type store.append(sto-j) # append the various values to the array values = np.array(store)**2 # Squaring the values # Summing the rows summed = [] for i in values: summed.append((i.sum(axis = 1))) df_new = pd.concat(summed , axis = 1) df_new.T
© www.soinside.com 2019 - 2024. All rights reserved.