更改代码以使用多个核心

问题描述 投票:1回答:1

对于一个项目,我试图获得不同新闻文章的情绪。我试图使用sentimentr包来做到这一点。但是,由于我有一些文章,我试图通过使用我的处理器的多个核心来加速这一点。目前的代码如下:

 library(sentimentr)
 #Extract sentences
 df_sentences <- text1 %>%
  select(content) %>%
  get_sentences()

#Get sentiment score
df_sentences2 <- text1 %>%
  select(content) %>%
  lapply(get_sentences())

Text1是一个数据框,其中包含有关这些文章的文章和信息,content列是包含实际文章文本的列。我已经在网上找到了parallel包,它可以让你这样做。我尝试使用下面的代码实现这个包,不幸的是它似乎没有使用更多的核心,因为速度保持不变。

library(sentimentr)
library(parallel)
cl <- makeCluster(detectCores() - 1)
clusterEvalQ(cl, library(sentimentr))
clusterExport(cl, "text1")
df_sentences2 <- text1 %>% select(content) %>% parLapply(cl, ., get_sentences)
df_sentiment <- df_sentences2 %>% 
  parSapply(cl, ., sentiment_by)
stopCluster(cl)

我希望有人可以帮助我,告诉我,如果我正确地做了,或者我必须改变它才能正常工作,因为它可以节省我很多时间。非常感谢所有帮助!示例数据包括在下面:

structure(list(X = 0:4, id = 17284:17288, title = c("Example Title", 
"Example Title", "Example Title", "Example Title", "Example Title"
), publication = c("New York Times", "New York Times", "New York Times", 
"New York Times", "New York Times"), author = c("Example Writer", 
"Example Writer", "Example Writer", "Example Writer", "Example Writer"
), date = c("2016-12-31", "2015-12-31", "2014-12-31", "2013-12-31", 
"2012-12-31"), year = c(2016, 2016, 2016, 2016, 2016), month = c(12, 
12, 12, 12, 12), url = c(NA, NA, NA, NA, NA), content = c("This is an example sentence. This is another example sentence", 
"This is an example sentence. This is another example sentence", 
"This is an example sentence. This is another example sentence", 
"This is an example sentence. This is another example sentence", 
"This is an example sentence. This is another example sentence"
)), .Names = c("X", "id", "title", "publication", "author", "date", 
"year", "month", "url", "content"), class = "data.frame", row.names = c(NA, 
-5L))

编辑:

我已将原始代码更改为将@F.Privé的注释合并到以下内容中,但执行操作所需的时间保持不变。我希望有人知道我需要改变什么来让它正常工作。

library(sentimentr)
library(parallel)
cl <- makeCluster(detectCores() - 1)
clusterEvalQ(cl, library(sentimentr))
clusterExport(cl, "text1")
df_sentences <- text1 %>% 
  pull(content) %>% 
  parLapply(cl, ., get_sentences)
df_sentiment <- df_sentences2 %>% 
  parLapply(cl, ., sentiment_by)
stopCluster(cl)
r performance parallel-processing sentiment-analysis text-analysis
1个回答
0
投票

因此,最好的方法是将矢量分成ncores部分,这样每个核心都可以完成整个计算的一部分。

在我的一个软件包中,我有一个使用foreach执行此操作的功能:

# devtools::install_github("privefl/bigstatsr")
library(bigstatsr)

res <- big_parallelize(text1[["content"]], p.FUN = function(x, ind) {
  sentimentr::get_sentences(x[ind])
}, p.combine = 'c', ind = rows_along(text1), ncores = nb_cores())

structure(res, class = c("get_sentences", "get_sentences_character", "list"))
© www.soinside.com 2019 - 2024. All rights reserved.