pyspark RDD字计算

问题描述 投票:0回答:1

我有一个带有文本和类别的数据框架。我想计算这些类别中常见的单词。我正在使用 nltk 来删除停顿词和 tokenize,但是在这个过程中无法包含类别。以下是我的问题示例代码。

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession,Row
import nltk
spark_conf = SparkConf()\
        .setAppName("test")
sc=SparkContext.getOrCreate(spark_conf) 

def wordTokenize(x):
    words = [word for line in x for word in line.split()]
    return words

def rmstop(x):
    from nltk.corpus import stopwords
    stop_words=set(stopwords.words('english'))
    word = [w for w in x if not w in stop_words]
    return word

# in actual problem I have a file which I am reading as a dataframe
# so creating a dataframe first 

df = [('Happy','I am so happy today'),
     ('Happy', 'its my birthday'),
     ('Happy', 'lets have fun'),
    ('Sad', 'I am going to die today'),
    ('Neutral','I am going to office today'),('Neutral','This is my house')]
rdd = sc.parallelize(df)
rdd_data = rdd.map(lambda x: Row(Category=x[0], text=x[1]))
df_data = sqlContext.createDataFrame(rdd_data)


#convert to rdd for nltk process
df_data_rdd = df_data.select('text').rdd.flatMap(lambda x: x)

#make it lower and sentence tokenize
df_data_rdd1 = df_data_rdd.map(lambda x : x.lower())\
.map(lambda x: nltk.sent_tokenize(x))

#word tokenize
data_rdd1_words   = df_data_rdd1.map(wordTokenize)

#stop word and distinct
data_rdd1_words_clean = data_rdd1_words.map(rmstop)\
.flatMap(lambda x: x)\
.distinct()

data_rdd1_words_clean.collect()

output : ['today', 'birthday', 'let', 'die', 'house', 'happy', 'fun', 'going', 'office'] 。

我想统计词(预处理后)在类别上的频率。例如 "今天":3,因为它在三个类别中都存在。

apache-spark pyspark nltk rdd
1个回答
0
投票

这里,extractphraseRDD是一个包含你的短语的RDD。因此,下面的代码将计算单词的数量,并按升序显示。

freqDistRDD = extractphraseRDD.flatMap(lambda x : nltk.FreqDist(x).most_common()).map(lambda x: x).reduceByKey(lambda x,y : x+y).sortBy(lambda x: x[1], ascending = False)
© www.soinside.com 2019 - 2024. All rights reserved.