我在Python中使用Reddit API抓取抓取了数据，但像'之类的字符显示为-t。我该如何解决？

Question

我是Python的新手（还有StackOverflow，如果我做错了，请原谅）。我从/ r / loseit subreddit抓取了提交的内容，因此我可以清理它并在R中创建wordcloud进行分配。抓取效果很好，但是特殊字符显示为垃圾，例如以下句子：

“ ALLLL，我想我已经打破了高原，而我..”

我的代码：

import praw



#Set up app
reddit = praw.Reddit(client_id='removed',
                     client_secret='removed',
                     user_agent='removed')


#Import pandas library as pd
import pandas as pd

#Make empty dataset
posts = []

#Function to scrape body for textposts and append to posts. We only want textposts, any other data is not necessary.
li_subreddit = reddit.subreddit('LoseIt')
for post in li_subreddit.new(limit=1000):
    posts.append([post.selftext])
posts = pd.DataFrame(posts,columns=['body'])
posts


#Save as csv
posts.to_csv('loseit2.csv')

Answer 1

Mojibake !!

ftfy将修复的最有趣的中断类型是，当有人使用一种标准对Unicode进行编码，而使用另一种标准对其进行解码时。这通常表现为变成无意义的序列的字符（称为“ mojibake”）：

单词schön可能显示为schÃ¶n。
破折号（-）可能显示为â€]]。
原本应该用引号引起来的文本可能最终会改为用â€>>和â€<9d>

<9d>

像这样安装“为您修复文本”；pip3 install ftfy并按以下方式使用它；

import ftfy

# this will fix your encoding problem
posts['body'] = posts['body'].map(ftfy.fix_encoding)

我在Python中使用Reddit API抓取抓取了数据，但像'之类的字符显示为-t。我该如何解决？

问题描述投票：0回答：1

1个回答

最新问题

我在Python中使用Reddit API抓取抓取了数据，但像'之类的字符显示为-t。我该如何解决？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1