我如何使用spark和python访问镶木地板表中单元内的嵌套数组？

Question

我想在表的情感列中提取“文本”，并按city = london进行过滤。

我有一个看起来像这样的表：

name    city    sentiment
    harry   london  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='happy'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='sad'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='mad')
                ]"
sally   london  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='sad'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='mad'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='agitated')
                ]"
gary    london  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='excited'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='down'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='agitated')
                ]"
mary    manchester  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='sad'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='low'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='content')
                ]"
gerry   manchester  "[
                  Row(score='0.999926',
                  sentiment=Row(score='-0.640237'),
                  text='ecstatic'),
                  Row(score='0.609836',
                  sentiment=Row(score='-0.607594'),
                  text='good'),
                  Row(score='0.58564',
                  sentiment=Row(score='-0.6833'),
                  text='bad')
                ]"

我的代码当前看起来像这样，但是不起作用：

from pyspark.sql import functions as F
from pyspark.sql import types as T

data= spark.read.parquet("INSERT S3 TABLE").where("city LIKE 'london' AND sentiment['text=']")
df = sharethis.toPandas()
print (df)

而且我希望输出看起来像这样：

name    city    sentiment
harry   london  happy
harry   london  sad
harry   london  mad
sally   london  sad
sally   london  mad
sally   london  agitated
gary    london  sad
gary    london  low
gary    london  content

有人知道我该如何访问情感列中的数组以提取文本？

提前感谢。

Answer 1

首先使用您的示例数据创建一个数据框：

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('explode_example').getOrCreate()

data = [
    ("harry", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "happy"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "sad"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "mad"}
    ]),
    ("sally", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "mad"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"}
    ]),
    ("gary", "london", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "excited"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "down"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"}
    ]),
    ("mary", "manchester", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "low"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "content"}
    ]),
    ("gerry", "manchester", [
        {"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "ecstatic"},
        {"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "good"},
        {"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "bad"}
    ])
]

df = spark.createDataFrame(data=data, schema = ["name", "city", "sentiment"])

您拥有的是以下数据框：

df.show(truncate=False)

+-----+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|name |city      |sentiment                                                                                                                                                                                                    |
+-----+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|harry|london    |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> happy], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> sad], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> mad]]        |
|sally|london    |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> mad], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]]     |
|gary |london    |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> excited], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> down], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]]|
|mary |manchester|[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> low], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> content]]      |
|gerry|manchester|[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> ecstatic], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> good], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> bad]]    |
+-----+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

一旦有了数据框，您就需要分解 sentiment列：

from pyspark.sql.functions import explode

df_exp = df.select(df["name"], df["city"], explode(df["sentiment"]))

结果：

df_exp.show(truncate=False)

+-----+----------+---------------------------------------------------------------------+
|name |city      |col                                                                  |
+-----+----------+---------------------------------------------------------------------+
|harry|london    |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> happy]   |
|harry|london    |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> sad]     |
|harry|london    |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> mad]        |
|sally|london    |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad]     |
|sally|london    |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> mad]     |
|sally|london    |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]   |
|gary |london    |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> excited] |
|gary |london    |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> down]    |
|gary |london    |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]   |
|mary |manchester|[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad]     |
|mary |manchester|[sentiment -> {score=-0.607594}, score -> 0.609836, text -> low]     |
|mary |manchester|[sentiment -> {score=-0.6833}, score -> 0.58564, text -> content]    |
|gerry|manchester|[sentiment -> {score=-0.640237}, score -> 0.999926, text -> ecstatic]|
|gerry|manchester|[sentiment -> {score=-0.607594}, score -> 0.609836, text -> good]    |
|gerry|manchester|[sentiment -> {score=-0.6833}, score -> 0.58564, text -> bad]        |
+-----+----------+---------------------------------------------------------------------+

最后，让我们创建一个仅包含文本的列，并检索3个所需的列：

# Extract text
df_exp = df_exp.withColumn("text", df_exp["col"].text)

# Select result columns
result = df_exp.select("name", "city", "text")

结果将是：

result.show(truncate=False)

+-----+----------+--------+
|name |city      |text    |
+-----+----------+--------+
|harry|london    |happy   |
|harry|london    |sad     |
|harry|london    |mad     |
|sally|london    |sad     |
|sally|london    |mad     |
|sally|london    |agitated|
|gary |london    |excited |
|gary |london    |down    |
|gary |london    |agitated|
|mary |manchester|sad     |
|mary |manchester|low     |
|mary |manchester|content |
|gerry|manchester|ecstatic|
|gerry|manchester|good    |
|gerry|manchester|bad     |
+-----+----------+--------+

我如何使用spark和python访问镶木地板表中单元内的嵌套数组？

问题描述投票：0回答：1

1个回答

最新问题

我如何使用spark和python访问镶木地板表中单元内的嵌套数组？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1