使用 Python / Pandas / NLTK 从 Dataframe 中分离英语和非英语句子

Question

我正在为我的研究项目使用 CrisisLexT26 数据集。数据框看起来像这样：

Tweet Text | Informativeness
local assistance neighbour boulder flood | Related
tourism singapore suffers haze blow | Related
estate chat con hiya wendy queen vive costa | Related

第 1 列包含推文文本，第 2 列讨论它是否与自然灾害有关。

我想创建两个数据框，一个只包含英文句子，另一个包含非英文句子

示例推文 1 和 2 应该出现在第一个数据框中，而推文 3 应该出现在另一个数据框中，因为它是一个非英语句子

我尝试使用检测库和各种 nltk 方法，但真的做不到。有人可以帮助我吗？

https://github.com/jeyadosstimothy/ML-on-CrisisLex/blob/master/CrisisLexT26/2012_Colorado_wildfires/2012_Colorado_wildfires-tweets_labeled.csv

Answer 1

from langdetect import detect
tweet_df['lang'] = tweet_df[' Tweet Text'].apply(detect)

需要时间来运行，但这有效

text blob 抛出一个请求错误

使用 Python / Pandas / NLTK 从 Dataframe 中分离英语和非英语句子

问题描述投票：0回答：1

1个回答

最新问题

使用 Python / Pandas / NLTK 从 Dataframe 中分离英语和非英语句子

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1