如何使用循环进行情感分析,以便对 114 篇文章进行同样的分析?

问题描述 投票:0回答:0

我正在尝试对 114 篇文章进行情感分析,并使用循环以紧凑的方式完成。 我需要从 Excel 文件中的链接中提取文本,对每个链接进行情感分析并计算一些变量(它们都存在于我附加的代码中)并再次将它们显示在 Excel 文件中。由于保密原因,我无法附上具有链接的 Excel 文件,但它的格式如下:

网址_ID 网址
37 https://
38 https://

是的,第一行(不包括标题)的值为 37,最后一行的值为 150。

下面是代码:

import re
import nltk
import string
import openpyxl
import requests
import textstat
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.corpus import opinion_lexicon
from nltk.tokenize import word_tokenize, sent_tokenize

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"}

df=pd.read_excel("Links.xlsx")

id_s = df["URL_ID"].to_numpy()
url_s = df["URL"].to_numpy()

for i,id in enumerate(id_s):
    r = requests.get(url_s[i],headers=headers).text
    soup = BeautifulSoup(r, 'lxml')
    with open(str(id_s[i])+".txt", 'w', encoding="utf-8") as file:
    for tag in soup.find_all('title'):
        file.write(tag.text.strip())
    for tag in soup.find_all("div", {"class": "td-post-content"}):
        file.write(tag.text.strip())

#List of stopwords
with open("Stop_Words.txt", "r") as s_file:
    s_list = s_file.read().lower().split("\n")

a = [i for i in range(114)] 

def work():
    for n in range(1,114):
            #read the file,convert to lowercase, split
            with open(str(id_s[i])+'.txt', 'r', encoding = "utf-8")as file1:
                text = file1.read().lower().split()

            # remove punctuation
            import string
            pun = str.maketrans({key: None for key in string.punctuation + '’' + '—' + '“' + '”'})
            no_pun = [t.translate(pun) for t in text]

            # filter stopwords
            for r in no_pun:
                if not r in s_list:
                    with open('filtered.txt','a') as filtered: 
                        filtered.write(" "+r)

            files = open("filtered.txt", 'r').read()

            #Tokenize words
            token = word_tokenize(files)

            #Create the positive and negative words dictionary from Master Dictionary:
            with open("positive-words.txt", "r", encoding="utf-8") as pos_file:
                read=pos_file.read()
                pos_list=read.split("\n")
                
                
            with open("negative-words.txt", "r") as neg_file:
                read=neg_file.read()
                neg_list=read.split("\n")
                

            #Function to categorise positive and negative words from the master words.
            def intersection(x, y):
                lst = [value for value in x if value in y]
                return lst
            p = intersection(pos_list, token)
            n = intersection(neg_list, token)
            pos = len(p)
            neg = len(n)

            #Stop words from NLTK
            stop_nltk = set(stopwords.words('english'))

            #Word count - Cleaned words
            word_count = [w for w in no_pun if not w in stop_nltk]

            #Tokenized_sentences
            strs = open(str(id_s[i])+'.txt', 'r', encoding = "utf-8").read()
            sent_text = nltk.sent_tokenize(strs)

            #Syllables
            vowelreg = re.compile(r'(?!e[ds]\b)[aeiou]', re.I)

            #Personal Pronouns
            regex = re.compile('I|we|We|My|my|ours|us') 
            pron = re.findall(regex, strs)

            #Number of characters
            char = textstat.char_count(strs, ignore_spaces=True)

            #Complex words
            comp = len([t for t in text if len(vowelreg.findall(t)) > 2])

            df2 = pd.read_excel("Sentiment Analysis.xlsx")
            df2.loc[a, 'POSITIVE SCORE'] = pos
            df2.loc[a, 'NEGATIVE SCORE'] = neg
            df2.loc[a, 'POLARITY SCORE'] = int(pos-neg)/int((pos+neg))+0.000001
            df2.loc[a, 'SUBJECTIVITY SCORE'] = int(pos+neg)/(len(word_count)+0.000001)
            df2.loc[a, 'AVG SENTENCE LENGTH'] = len(word_count)/len(sent_text)
            df2.loc[a, 'PERCENTAGE OF COMPLEX WORDS'] = comp/len(word_count)
            df2.loc[a, 'FOG INDEX'] = 0.4 * (len(word_count)/len(sent_text) + comp/len(word_count))
            df2.loc[a, 'AVG NUMBER OF WORDS PER SENTENCE'] = len(word_count)/len(sent_text)
            df2.loc[a, 'COMPLEX WORD COUNT'] = comp
            df2.loc[a, 'WORD COUNT'] = len(word_count)
            df2.loc[a, 'SYLLABLE PER WORD'] = len(vowelreg.findall(strs))/len(word_count)
            df2.loc[a, 'PERSONAL PRONOUNS'] = len(pron)
            df2.loc[a, 'AVG WORD LENGTH'] = char/len(word_count)

# THIS IS THE CELL THAT HAS THE KERNEL BUSY FOR ALMOST AN HOUR
work()

我必须补充一点,网络抓取部分进行得很顺利,我可以毫无问题地单独执行情感分析,就像每个链接一样,然后将其读入 Excel 文件。 我不太擅长循环,但我觉得循环可以解决这个问题。请让我知道我哪里出错了以及如何修复它(最好不要更改大部分代码)。

python loops for-loop kernel sentiment-analysis
© www.soinside.com 2019 - 2024. All rights reserved.