我在分配标签是否满足条件时遇到一些问题。具体来说,我想将 False(或 0)分配给至少包含这些单词之一的行
my_list=["maths", "science", "geography", "statistics"]
在以下领域之一:
path | Subject | Notes
并在
webs=["www.stanford.edu", "www.ucl.ac.uk", "www.sorbonne-universite.fr"]
栏中查找这些网站web
。
为此,我使用以下代码:
def part_is_in(x, values):
output = False
for val in values:
if val in str(x):
return True
break
return output
def assign_value(filename):
my_list=["maths", "", "science", "geography", "statistics"]
filename['Label'] = filename[['path','subject','notes']].apply(part_is_in, values= my_list)
filename['Low_Subject']=filename['Subject']
filename['Low_Notes']=filename['Notes']
lower_cols = [col for col in filename if col not in ['Subject','Notes']]
filename[lower_cols]= filename[lower_cols].apply(lambda x: x.astype(str).str.lower(),axis=1)
webs=["https://www.stanford.edu", "https://www.ucl.ac.uk", "http://www.sorbonne-universite.fr"]
# NEW COLUMN # this is still inside the function but I cannot add an indent within this post
filename['Label'] = pd.Series(index = filename.index, dtype='object')
for index, row in filename.iterrows():
value = row['web']
if any(x in str(value) for x in webs):
filename.at[index,'Label'] = True
else:
filename.at[index,'Label'] = False
for index, row in filename.iterrows():
value = row['Subject']
if any(x in str(value) for x in my_list):
filename.at[index,'Label'] = True
else:
filename.at[index,'Label'] = False
for index, row in filename.iterrows():
value = row['Notes']
if any(x in str(value) for x in my_list):
filename.at[index,'Label'] = True
else:
filename.at[index,'Label'] = False
for index, row in filename.iterrows():
value = row['path']
if any(x in str(value) for x in my_list):
filename.at[index,'Label'] = True
else:
filename.at[index,'Label'] = False
return(filename)
我的数据集是
web path Subject Notes
www.stanford.edu /maths/ NA NA
www.ucla.com /history/ History of Egypt NA
www.kcl.ac.uk /datascience/ Data Science 50 students
...
预期输出是:
web path Subject Notes Label
www.stanford.edu /maths/ NA NA 1 # contains the web and maths
www.ucla.com /history/ History of Egypt NA 0
www.kcl.ac.uk /datascience/ Data Science 50 students 1 # contains the word science
...
使用我的代码,我获得了所有值
False
。你能发现问题吗?
Labels
中的最终值为布尔值:
Label
列包含整数(0 或 1)而不是布尔值(True/False),您可以使用以下方法进行转换:
df['Label'] = df['Label'].astype(int)
Label
列包含 0 和 1。def test_words
功能:
path
、Subject
和 Notes
列中的值。NaN
) 填充所有 float
值(''
类型),将它们转换为 str
类型。/
字符。my_list
中的任何单词。
set.intersection
检查两组之间是否有重叠。{'datascience'}.intersection({'science'})
返回一个空集,因为没有交集。{'data', 'science'}.intersection({'science'})
返回 {'science'}
,因为该单词有交集。lambda x: any(x in y for y in webs)
:
web
列中的值是否存在于 webs
中列出的任何 URL 中。webs
中的每个值,它检查 web
值是否是该 URL 的子字符串。
'www.stanford.edu' in 'https://www.stanford.edu'
的计算结果为 True
。True
值匹配,则整体表达式的计算结果为 web
。def test_words(v: pd.Series) -> bool:
"""
Checks if any word from my_list is present in the combined values of 'path', 'Subject', and 'Notes'.
Args:
v (pd.Series): A row of the DataFrame containing 'path', 'Subject', and 'Notes'.
Returns:
bool: True if any word from my_list is found, otherwise False.
"""
# Fill NaN values with an empty string, convert to lowercase, replace '/' with ' ', and split on spaces
v = v.fillna('').str.lower().str.replace('/', ' ').str.split(' ')
# Create a set containing all unique words from the combined columns
s_set = {st for row in v for st in row if st}
# Check if there is any intersection between s_set and my_list
return True if s_set.intersection(my_list) else False
# Test data and DataFrame
data = {'web': ['www.stanford.edu', 'www.ucla.com', 'www.kcl.ac.uk'],
'path': ['/maths/', '/history/', '/datascience/'],
'Subject': [np.nan, 'History of Egypt', 'Data Science'],
'Notes': [np.nan, np.nan, '50 students']}
df = pd.DataFrame(data)
# Given my_list
my_list = ["maths", "science", "geography", "statistics"]
my_list = set(map(str.lower, my_list)) # Convert to a set and ensure words are lowercase
# Given webs; all values should be lowercase
webs = ["https://www.stanford.edu", "https://www.ucl.ac.uk", "http://www.sorbonne-universite.fr"]
# Update 'Label' based on conditions
df['Label'] = df[['path', 'Subject', 'Notes']].apply(test_words, axis=1) | df.web.apply(lambda x: any(x in y for y in webs))
display(df)
web path Subject Notes Label
0 www.stanford.edu /maths/ NaN NaN True
1 www.ucla.com /history/ History of Egypt NaN False
2 www.kcl.ac.uk /datascience/ Data Science 50 students True
iterrows
,尤其是对于大型数据集。考虑使用向量化运算或其他有效的方法。