分配标签：所有值均为 false

Question

我在分配标签是否满足条件时遇到一些问题。具体来说，我想将 False（或 0）分配给至少包含这些单词之一的行

my_list=["maths", "science", "geography", "statistics"]

在以下领域之一：

path | Subject | Notes

并在

webs=["www.stanford.edu", "www.ucl.ac.uk", "www.sorbonne-universite.fr"]

栏中查找这些网站

web

。

为此，我使用以下代码：

  def part_is_in(x, values):
        output = False
        for val in values:
            if val in str(x):
                return True
                break                
        return output


  def assign_value(filename):
    my_list=["maths", "", "science", "geography", "statistics"]
  

    filename['Label'] = filename[['path','subject','notes']].apply(part_is_in, values= my_list)
    filename['Low_Subject']=filename['Subject']
    filename['Low_Notes']=filename['Notes']
    lower_cols = [col for col in filename if col not in ['Subject','Notes']]
    filename[lower_cols]= filename[lower_cols].apply(lambda x: x.astype(str).str.lower(),axis=1)
    webs=["https://www.stanford.edu", "https://www.ucl.ac.uk", "http://www.sorbonne-universite.fr"]

# NEW COLUMN # this is still inside the function but I cannot add an indent within this post

filename['Label'] = pd.Series(index = filename.index, dtype='object')

for index, row in filename.iterrows():
        value = row['web']

        if any(x in str(value) for x in webs):
            filename.at[index,'Label'] = True
        else:
            filename.at[index,'Label'] = False

for index, row in filename.iterrows():
        value = row['Subject']

        if any(x in str(value) for x in my_list):
            filename.at[index,'Label'] = True
        else:
            filename.at[index,'Label'] = False

for index, row in filename.iterrows():
        value = row['Notes']

        if any(x in str(value) for x in my_list):
            filename.at[index,'Label'] = True
        else:
            filename.at[index,'Label'] = False
            
for index, row in filename.iterrows():
        value = row['path']

        if any(x in str(value) for x in my_list):
            filename.at[index,'Label'] = True
        else:
            filename.at[index,'Label'] = False
            
return(filename)

我的数据集是

web                        path         Subject                Notes
www.stanford.edu        /maths/           NA                    NA
www.ucla.com           /history/        History of Egypt        NA
www.kcl.ac.uk         /datascience/     Data Science            50 students
...

预期输出是：

web                        path         Subject                Notes           Label
www.stanford.edu        /maths/           NA                    NA               1    # contains the web and maths
www.ucla.com           /history/        History of Egypt        NA               0    
www.kcl.ac.uk         /datascience/     Data Science            50 students      1    # contains the word science
...

使用我的代码，我获得了所有值

False

。你能发现问题吗？

Answer 1

```
Labels
```
中的最终值为布尔值:
- 如果您希望
```
Label
```
  列包含整数（0 或 1）而不是布尔值（True/False），您可以使用以下方法进行转换：
```
df['Label'] = df['Label'].astype(int)
```
- 这将确保
```
Label
```
  列包含 0 和 1。
```
def test_words
```
功能：
- 此函数处理
```
path
```
  、
```
Subject
```
  和
```
Notes
```
  列中的值。
- 以下是它执行的步骤：
  - 用空字符串 (
```
NaN
```
    ) 填充所有
```
float
```
    值（
```
''
```
    类型），将它们转换为
```
str
```
    类型。
  - 将所有单词转换为小写。
  - 用空格替换所有
```
/
```
    字符。
  - 将生成的字符串拆分为空格以创建单词列表。
  - 将所有列表组合成一个集合。
  - 使用集合交集确定该行是否包含
```
my_list
```
    中的任何单词。
    - set.intersection
      检查两组之间是否有重叠。
    - 例如：
      - {'datascience'}.intersection({'science'})
        返回一个空集，因为没有交集。
      - {'data', 'science'}.intersection({'science'})
        返回
        {'science'}
        ，因为该单词有交集。
```
lambda x: any(x in y for y in webs)
```
：
- 此 lambda 函数检查
```
web
```
  列中的值是否存在于
```
webs
```
  中列出的任何 URL 中。
- 对于
```
webs
```
  中的每个值，它检查
```
web
```
  值是否是该 URL 的子字符串。
  - 例如，
```
'www.stanford.edu' in 'https://www.stanford.edu'
```
    的计算结果为
```
True
```
    。
- 如果任何 URL 与
```
True
```
  值匹配，则整体表达式的计算结果为
```
web
```
  。

def test_words(v: pd.Series) -> bool:
    """
    Checks if any word from my_list is present in the combined values of 'path', 'Subject', and 'Notes'.
    
    Args:
        v (pd.Series): A row of the DataFrame containing 'path', 'Subject', and 'Notes'.
    
    Returns:
        bool: True if any word from my_list is found, otherwise False.
    """
    # Fill NaN values with an empty string, convert to lowercase, replace '/' with ' ', and split on spaces
    v = v.fillna('').str.lower().str.replace('/', ' ').str.split(' ')
    
    # Create a set containing all unique words from the combined columns
    s_set = {st for row in v for st in row if st}
    
    # Check if there is any intersection between s_set and my_list
    return True if s_set.intersection(my_list) else False

# Test data and DataFrame
data = {'web': ['www.stanford.edu', 'www.ucla.com', 'www.kcl.ac.uk'],
        'path': ['/maths/', '/history/', '/datascience/'],
        'Subject': [np.nan, 'History of Egypt', 'Data Science'],
        'Notes': [np.nan, np.nan, '50 students']}

df = pd.DataFrame(data)

# Given my_list
my_list = ["maths", "science", "geography", "statistics"]
my_list = set(map(str.lower, my_list))  # Convert to a set and ensure words are lowercase

# Given webs; all values should be lowercase
webs = ["https://www.stanford.edu", "https://www.ucl.ac.uk", "http://www.sorbonne-universite.fr"]

# Update 'Label' based on conditions
df['Label'] = df[['path', 'Subject', 'Notes']].apply(test_words, axis=1) | df.web.apply(lambda x: any(x in y for y in webs))

display(df)

                web           path           Subject        Notes  Label
0  www.stanford.edu        /maths/               NaN          NaN   True
1      www.ucla.com      /history/  History of Egypt          NaN  False
2     www.kcl.ac.uk  /datascience/      Data Science  50 students   True

有关原始代码的注释：

不建议多次使用
```
iterrows
```
，尤其是对于大型数据集。考虑使用向量化运算或其他有效的方法。
新函数巩固了逻辑并使其更具可读性。

分配标签：所有值均为 false

问题描述投票：0回答：1

1个回答

有关原始代码的注释：

最新问题

分配标签：所有值均为 false

问题描述 投票：0回答：1

1个回答

有关原始代码的注释：

最新问题

问题描述投票：0回答：1