我如何使用spark和python访问镶木地板表中单元内的嵌套数组?

问题描述 投票:0回答:1

我想在表的情感列中提取“文本”,并按city = london进行过滤。

我有一个看起来像这样的表:

name    city    sentiment
    harry   london  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='happy'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='sad'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='mad')
                ]"
sally   london  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='sad'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='mad'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='agitated')
                ]"
gary    london  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='excited'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='down'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='agitated')
                ]"
mary    manchester  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='sad'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='low'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='content')
                ]"
gerry   manchester  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='ecstatic'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='good'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='bad')
                ]"

我的代码当前看起来像这样,但是不起作用:

from pyspark.sql import functions as F
from pyspark.sql import types as T

data= spark.read.parquet("INSERT S3 TABLE").where("city LIKE 'london' AND sentiment['text=']")
df = sharethis.toPandas()
print (df)

而且我希望输出看起来像这样:

name    city    sentiment
harry   london  happy
harry   london  sad
harry   london  mad
sally   london  sad
sally   london  mad
sally   london  agitated
gary    london  sad
gary    london  low
gary    london  content

有人知道我该如何访问情感列中的数组以提取文本?

提前感谢。

python sql arrays dataframe parquet
1个回答
0
投票

首先使用您的示例数据创建一个数据框:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('explode_example').getOrCreate()

data = [
    ("harry", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "happy"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "sad"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "mad"}
    ]),
    ("sally", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "mad"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"}
    ]),
    ("gary", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "excited"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "down"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"}
    ]),
    ("mary", "manchester", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "low"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "content"}
    ]),
    ("gerry", "manchester", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "ecstatic"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "good"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "bad"}
    ])
]

df = spark.createDataFrame(data=data, schema = ["name", "city", "sentiment"])

您拥有的是以下数据框:

df.show(truncate=False)

+-----+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|name |city      |sentiment                                                                                                                                                                                                    |
+-----+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|harry|london    |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> happy], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> sad], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> mad]]        |
|sally|london    |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> mad], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]]     |
|gary |london    |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> excited], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> down], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]]|
|mary |manchester|[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> low], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> content]]      |
|gerry|manchester|[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> ecstatic], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> good], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> bad]]    |
+-----+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

一旦有了数据框,您就需要分解 sentiment列:

from pyspark.sql.functions import explode

df_exp = df.select(df["name"], df["city"], explode(df["sentiment"]))

结果:

df_exp.show(truncate=False)

+-----+----------+---------------------------------------------------------------------+
|name |city      |col                                                                  |
+-----+----------+---------------------------------------------------------------------+
|harry|london    |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> happy]   |
|harry|london    |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> sad]     |
|harry|london    |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> mad]        |
|sally|london    |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad]     |
|sally|london    |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> mad]     |
|sally|london    |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]   |
|gary |london    |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> excited] |
|gary |london    |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> down]    |
|gary |london    |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]   |
|mary |manchester|[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad]     |
|mary |manchester|[sentiment -> {score=-0.607594}, score -> 0.609836, text -> low]     |
|mary |manchester|[sentiment -> {score=-0.6833}, score -> 0.58564, text -> content]    |
|gerry|manchester|[sentiment -> {score=-0.640237}, score -> 0.999926, text -> ecstatic]|
|gerry|manchester|[sentiment -> {score=-0.607594}, score -> 0.609836, text -> good]    |
|gerry|manchester|[sentiment -> {score=-0.6833}, score -> 0.58564, text -> bad]        |
+-----+----------+---------------------------------------------------------------------+

最后,让我们创建一个仅包含文本的列,并检索3个所需的列:

# Extract text
df_exp = df_exp.withColumn("text", df_exp["col"].text)

# Select result columns
result = df_exp.select("name", "city", "text")

结果将是:

result.show(truncate=False)

+-----+----------+--------+
|name |city      |text    |
+-----+----------+--------+
|harry|london    |happy   |
|harry|london    |sad     |
|harry|london    |mad     |
|sally|london    |sad     |
|sally|london    |mad     |
|sally|london    |agitated|
|gary |london    |excited |
|gary |london    |down    |
|gary |london    |agitated|
|mary |manchester|sad     |
|mary |manchester|low     |
|mary |manchester|content |
|gerry|manchester|ecstatic|
|gerry|manchester|good    |
|gerry|manchester|bad     |
+-----+----------+--------+
© www.soinside.com 2019 - 2024. All rights reserved.