我有一个 Excel 文件 (input.xlsx),其中包含两列(id 和 url)。
我对所有网址进行了网页抓取,并对文本进行了文本分析。
我有计算正分数、负分数、极性等的函数。
我想创建一个包含上述所有结果的输出文件(output.xlsx),但我的脚本在所有行中打印相同的输出,但它在函数内打印正确的输出。
示例:
列:Id、url、正分、负分、极性等
行:行将包含每个函数的输出。
预期输出: 正分(列):23, 70, 43, 35(行)
实际产量: 正分(列):35, 35, 35, 35(行)
我的职能:
#CALCULATING POSITIVE SCORES
# Cleaned texts
os.getcwd()
new_texts_folder = os.path.join(os.getcwd(), 'new_texts')
for root, folders, files in os.walk(new_texts_folder):
for file in files:
path = os.path.join(root, file)
with codecs.open(path, encoding='utf-8', errors='ignore') as info:
new_content = eval(info.read()) # Convert string to list
def positive_score(content):
#tokens = tokenz(text)
pos_score = 0
for token in content:
if token in filtered_positive_dictionary:
pos_score += 1
return pos_score
#positive_result = positive_score(new_content)
上述代码仅当您在函数内打印时才会打印正确的输出。它只在函数之外打印一个输出。
我的Excel函数:
data_collection = {
'URL_ID': url_ids, #(this is working as expected)
'URL': urls, #(this is working as expected)
'POSITIVE SCORE': positive_score(new_content) #(this is not working as expected)
}
excel_data_df = pd.DataFrame(data_collection)
excel_data_df.to_excel("Outputput.xlsx", index = False)
出现您遇到的问题是因为您为每个文件调用一次 Positive_score 函数,但仅对 DataFrame 中的所有条目使用最后一个结果。为了解决这个问题,您需要将每个文件的结果存储在一个列表中,然后在创建 DataFrame 时使用该列表。
试试这个:
import os
import codecs
import pandas as pd
filtered_positive_dictionary = {'good': 1, 'excellent': 1, 'happy': 1}
def positive_score(content):
pos_score = 0
for token in content:
if token in filtered_positive_dictionary:
pos_score += 1
return pos_score
url_ids = []
urls = []
positive_scores = []
new_texts_folder = os.path.join(os.getcwd(), 'new_texts')
for root, folders, files in os.walk(new_texts_folder):
for file in files:
path = os.path.join(root, file)
with codecs.open(path, encoding='utf-8', errors='ignore') as info:
new_content = eval(info.read())
pos_score = positive_score(new_content)
url_ids.append(file)
urls.append(f"file://{path}")
positive_scores.append(pos_score)
data_collection = {
'URL_ID': url_ids,
'URL': urls,
'POSITIVE SCORE': positive_scores
}
excel_data_df = pd.DataFrame(data_collection)
excel_data_df.to_excel("Output.xlsx", index=False)