我是Python的新手(还有StackOverflow,如果我做错了,请原谅)。我从/ r / loseit subreddit抓取了提交的内容,因此我可以清理它并在R中创建wordcloud进行分配。抓取效果很好,但是特殊字符显示为垃圾,例如以下句子:
“ ALLLL,我想我已经打破了高原,而我..”
我的代码:
import praw
#Set up app
reddit = praw.Reddit(client_id='removed',
client_secret='removed',
user_agent='removed')
#Import pandas library as pd
import pandas as pd
#Make empty dataset
posts = []
#Function to scrape body for textposts and append to posts. We only want textposts, any other data is not necessary.
li_subreddit = reddit.subreddit('LoseIt')
for post in li_subreddit.new(limit=1000):
posts.append([post.selftext])
posts = pd.DataFrame(posts,columns=['body'])
posts
#Save as csv
posts.to_csv('loseit2.csv')
Mojibake !!
ftfy将修复的最有趣的中断类型是,当有人使用一种标准对Unicode进行编码,而使用另一种标准对其进行解码时。这通常表现为变成无意义的序列的字符(称为“ mojibake”):
像这样安装“为您修复文本”;pip3 install ftfy
并按以下方式使用它;
import ftfy
# this will fix your encoding problem
posts['body'] = posts['body'].map(ftfy.fix_encoding)