Python中计算相关系数的问题

问题描述 投票:0回答:1

我正在开发一个 Python 程序,用于分析 CSV 文件中的社交媒体数据。该程序应该根据数据计算两个变量之间的相关系数,但我遇到一个问题,无论输入数据如何,相关系数始终返回“0”。我们不允许进口或要求打印。

我在下面包含了相关的代码片段:

def platform_with_highest_users(data):
    # If data is empty
    if not data:
        return "Error: Empty data"
    
    # Initialize an empty dictionary to store platform counts
    platform_counts = {}
    
    # Count users for each platform
    for row in data:
        platform = row[4]
        platform_counts[platform] = platform_counts.get(platform, 0) + 1
    
    # Find the platform with the highest number of users
    highest_users = max(platform_counts.values())
    highest_platforms = [platform for platform, count in platform_counts.items() if count == highest_users]
    
    # If there's only one platform with the highest number of users, directly return its filtered data
    if len(highest_platforms) == 1:
        chosen_platform = highest_platforms[0]
    else:
        # Sort the highest platforms alphabetically and pick the first one
        chosen_platform = sorted(highest_platforms)[0]
    
    # Filter the data for the chosen platform
    filtered_data = [row for row in data if row[4] == chosen_platform]
    
    # Return filtered data for the chosen platform
    return filtered_data

    
def correlation_coefficient(filtered_data):
    if not filtered_data:
        return 0  

    x = [int(row[1]) for row in filtered_data]
    y = [int(row[9]) for row in filtered_data]

    # Calculate the correlation value following the given formula
    avg_x = (sum(x) / len(x))
    avg_y = (sum(y) / len(y))

    numerator = sum((x[i] - avg_x) * (y[i] - avg_y) for i in range(len(filtered_data)))
    denominator_x = (sum((x[i] - avg_x)) ** 2 for i in range(len(filtered_data)))
    denominator_y = (sum((y[i] - avg_y)) ** 2 for i in range(len(filtered_data)))
    
    # Calculate the denominator correctly
    denominator = (denominator_x * denominator_y) ** 0.5
    
    # Avoid division by zero
    correlation = numerator / denominator if denominator != 0 else 0
    
    return round(correlation, 4)

我知道相关值不能为零,因为我们得到了这个样本输出作为参考。

OP1, OP2, OP3, OP4 = main('SocialMedia.csv', [18,25], '澳大利亚')
返回的输出变量为: OP1 [['11', 4708.0], ['126', 5785.0], ['184', 9266.0]] OP2 ['澳大利亚'、'孟加拉国'、'爱尔兰'、'新西兰'、'巴基斯坦'、'也门'] >> OP3 [3.5556, 112446.1548, '农村']
OP4 0.4756

相关值:用户数量最多的社交媒体平台用户群的年龄与收入相关性的数值。如果有多个社交媒体平台具有相同的最高用户数,则按字母顺序对它们进行排序,并考虑第一个平台来查找相关性。

python debugging filtering
1个回答
0
投票

在您的代码中, denominator_x 和 denominator_y 是生成器。您不能将一个生成器与另一个生成器相乘。您的代码将引发 TypeError 异常。

更稳健的实现是:

def correlation_coefficient(x: list[float], y: list[float]) -> float:
    if (_n := len(x)) and _n == len(y):
        mx = sum(x) / _n
        my = sum(y) / _n
        n = da = db = 0.0
        for _x, _y in zip(x, y):
            dx = _x - mx
            dy = _y - my
            n += dx * dy
            da += dx * dx
            db += dy * dy
        try:
            return n / (da ** 0.5 * db ** 0.5)
        except ZeroDivisionError:
            pass
    return 0.0

注:

如果允许导入其他模块,scipy.stats.pearsonr 会更好

© www.soinside.com 2019 - 2024. All rights reserved.