Pandas 数据框 - 不明确

问题描述 投票:0回答:1

我正在尝试使用一些运行 Jaro Winkler 函数的代码来比较两个字符串的相似性。如果我只是硬编码两个值 john 和 jon,那么使用下面的逻辑就不会出现任何问题。但是我想要的是使用 csv 文件并比较所有名称。当我尝试时,我得到了

ValueError:系列的真值不明确。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。

# Python3 implementation of above approach
from math import floor
import pandas as pd

# Function to calculate the
# Jaro Similarity of two strings
def jaro_distance(s1, s2):
    # If the strings are equal
    if (s1 == s2):
        return 1.0;

    # Length of two strings
    len1 = len(s1);
    len2 = len(s2);

    if (len1 == 0 or len2 == 0):
        return 0.0;

    # Maximum distance upto which matching
    # is allowed
    max_dist = (max(len(s1), len(s2)) // 2) - 1;

    # Count of matches
    match = 0;

    # Hash for matches
    hash_s1 = [0] * len(s1);
    hash_s2 = [0] * len(s2);

    # Traverse through the first string
    for i in range(len1):

        # Check if there is any matches
        for j in range(max(0, i - max_dist),
                       min(len2, i + max_dist + 1)):

            # If there is a match
            if (s1[i] == s2[j] and hash_s2[j] == 0):
                hash_s1[i] = 1;
                hash_s2[j] = 1;
                match += 1;
                break;

    # If there is no match
    if (match == 0):
        return 0.0;

    # Number of transpositions
    t = 0;

    point = 0;

    # Count number of occurrences
    # where two characters match but
    # there is a third matched character
    # in between the indices
    for i in range(len1):
        if (hash_s1[i]):

            # Find the next matched character
            # in second string
            while (hash_s2[point] == 0):
                point += 1;

            if (s1[i] != s2[point]):
                point += 1;
                t += 1;
            else:
                point += 1;

        t /= 2;

    # Return the Jaro Similarity
    return ((match / len1 + match / len2 +
             (match - t) / match) / 3.0);


# Jaro Winkler Similarity
def jaro_Winkler(s1, s2):
    jaro_dist = jaro_distance(s1, s2);

    # If the jaro Similarity is above a threshold
    if (jaro_dist > 0.7):

        # Find the length of common prefix
        prefix = 0;

        for i in range(min(len(s1), len(s2))):

            # If the characters match
            if (s1[i] == s2[i]):
                prefix += 1;

            # Else break
            else:
                break;

        # Maximum of 4 characters are allowed in prefix
        prefix = min(4, prefix);

        # Calculate jaro winkler Similarity
        jaro_dist += 0.1 * prefix * (1 - jaro_dist);

    return jaro_dist;


# Driver code
if __name__ == "__main__":
    df = pd.read_csv('names.csv')
    # s1 = 'john' -- this works
    # s1 = 'jon' -- this works
    s1 = df['name1'] --this doesn't. csv contains header row name1, name2, and a few rows in each
    s2 = df['name2'] --this doesn't

    print("Jaro-Winkler Similarity =", jaro_Winkler(s1, s2));
Traceback (most recent call last):
  File "C:\Users\john\PycharmProjects\heatMap\Jaro.py", line 113, in <module>
    print("Jaro-Winkler Similarity =", jaro_Winkler(s1, s2));
  File "C:\Users\john\PycharmProjects\heatMap\Jaro.py", line 80, in jaro_Winkler
    jaro_dist = jaro_distance(s1, s2);
  File "C:\Users\john\PycharmProjects\heatMap\Jaro.py", line 9, in jaro_distance
    if (s1 == s2):
  File "C:\Users\john\PycharmProjects\heatMap\venv\lib\site-packages\pandas\core\generic.py", line 1537, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Process finished with exit code 1

来自 csv 的示例 enter image description here

python pandas ambiguous jaro-winkler
1个回答
0
投票

你面临的主要问题不是你的功能,而是逻辑。

假设我们想要评估一个陈述是真还是假,例如比较两个数字。当我们有 1 个数字时,这很容易,我们只需比较这些值即可(

1=1
1!=2
,...)。

但是假设我们想要将一组值与另一个值进行比较,例如

[1,2,3,4]
与 1?

嗯,在我们看来,这很简单,我们只需比较每个数字,所以

1=1
1!=2
,等等。但如果我们想知道列表是否等于 1,我们就会发现一个问题,因为列表作为一个整体是相等的,但同时又不相等。

这是您收到该错误的主要原因,您正在尝试将列表与其他内容进行比较。回溯表明:

使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。

这些都是告诉代码如何将列表/系列与其他内容进行比较的函数,要么只选择空值,将它们转换为布尔值,选择一个项目,检查是否有任何值为真或所有值都为真(分别为).

另一种选择是使用方法 .apply(),正如 @Nick Odell 在他们关于 this post 的评论中所建议的。此方法将函数应用于数据帧的每一行,因此它应该可以解决问题,因为您可以逐行检查真相。

© www.soinside.com 2019 - 2024. All rights reserved.