我正在开发一个 Python 程序,用于分析 CSV 文件中的社交媒体数据。该程序应该根据数据计算两个变量之间的相关系数,但我遇到一个问题,无论输入数据如何,相关系数始终返回“0”。我们不允许进口或要求打印。
我在下面包含了相关的代码片段:
def platform_with_highest_users(data):
# If data is empty
if not data:
return "Error: Empty data"
# Initialize an empty dictionary to store platform counts
platform_counts = {}
# Count users for each platform
for row in data:
platform = row[4]
platform_counts[platform] = platform_counts.get(platform, 0) + 1
# Find the platform with the highest number of users
highest_users = max(platform_counts.values())
highest_platforms = [platform for platform, count in platform_counts.items() if count == highest_users]
# If there's only one platform with the highest number of users, directly return its filtered data
if len(highest_platforms) == 1:
chosen_platform = highest_platforms[0]
else:
# Sort the highest platforms alphabetically and pick the first one
chosen_platform = sorted(highest_platforms)[0]
# Filter the data for the chosen platform
filtered_data = [row for row in data if row[4] == chosen_platform]
# Return filtered data for the chosen platform
return filtered_data
def correlation_coefficient(filtered_data):
if not filtered_data:
return 0
x = [int(row[1]) for row in filtered_data]
y = [int(row[9]) for row in filtered_data]
# Calculate the correlation value following the given formula
avg_x = (sum(x) / len(x))
avg_y = (sum(y) / len(y))
numerator = sum((x[i] - avg_x) * (y[i] - avg_y) for i in range(len(filtered_data)))
denominator_x = (sum((x[i] - avg_x)) ** 2 for i in range(len(filtered_data)))
denominator_y = (sum((y[i] - avg_y)) ** 2 for i in range(len(filtered_data)))
# Calculate the denominator correctly
denominator = (denominator_x * denominator_y) ** 0.5
# Avoid division by zero
correlation = numerator / denominator if denominator != 0 else 0
return round(correlation, 4)
我知道相关值不能为零,因为我们得到了这个样本输出作为参考。
OP1, OP2, OP3, OP4 = main('SocialMedia.csv', [18,25], '澳大利亚')
返回的输出变量为: OP1 [['11', 4708.0], ['126', 5785.0], ['184', 9266.0]] OP2 ['澳大利亚'、'孟加拉国'、'爱尔兰'、'新西兰'、'巴基斯坦'、'也门'] >> OP3 [3.5556, 112446.1548, '农村']
OP4 0.4756
相关值:用户数量最多的社交媒体平台用户群的年龄与收入相关性的数值。如果有多个社交媒体平台具有相同的最高用户数,则按字母顺序对它们进行排序,并考虑第一个平台来查找相关性。
在您的代码中, denominator_x 和 denominator_y 是生成器。您不能将一个生成器与另一个生成器相乘。您的代码将引发 TypeError 异常。
更稳健的实现是:
def correlation_coefficient(x: list[float], y: list[float]) -> float:
if (_n := len(x)) and _n == len(y):
mx = sum(x) / _n
my = sum(y) / _n
n = da = db = 0.0
for _x, _y in zip(x, y):
dx = _x - mx
dy = _y - my
n += dx * dy
da += dx * dx
db += dy * dy
try:
return n / (da ** 0.5 * db ** 0.5)
except ZeroDivisionError:
pass
return 0.0
注:
如果允许导入其他模块,scipy.stats.pearsonr 会更好