我想在表的情感列中提取“文本”,并按city = london进行过滤。
我有一个看起来像这样的表:
name city sentiment
harry london "[
Row(score='0.999926',
sentiment=Row(score='-0.640237'),
text='happy'),
Row(score='0.609836',
sentiment=Row(score='-0.607594'),
text='sad'),
Row(score='0.58564',
sentiment=Row(score='-0.6833'),
text='mad')
]"
sally london "[
Row(score='0.999926',
sentiment=Row(score='-0.640237'),
text='sad'),
Row(score='0.609836',
sentiment=Row(score='-0.607594'),
text='mad'),
Row(score='0.58564',
sentiment=Row(score='-0.6833'),
text='agitated')
]"
gary london "[
Row(score='0.999926',
sentiment=Row(score='-0.640237'),
text='excited'),
Row(score='0.609836',
sentiment=Row(score='-0.607594'),
text='down'),
Row(score='0.58564',
sentiment=Row(score='-0.6833'),
text='agitated')
]"
mary manchester "[
Row(score='0.999926',
sentiment=Row(score='-0.640237'),
text='sad'),
Row(score='0.609836',
sentiment=Row(score='-0.607594'),
text='low'),
Row(score='0.58564',
sentiment=Row(score='-0.6833'),
text='content')
]"
gerry manchester "[
Row(score='0.999926',
sentiment=Row(score='-0.640237'),
text='ecstatic'),
Row(score='0.609836',
sentiment=Row(score='-0.607594'),
text='good'),
Row(score='0.58564',
sentiment=Row(score='-0.6833'),
text='bad')
]"
我的代码当前看起来像这样,但是不起作用:
from pyspark.sql import functions as F
from pyspark.sql import types as T
data= spark.read.parquet("INSERT S3 TABLE").where("city LIKE 'london' AND sentiment['text=']")
df = sharethis.toPandas()
print (df)
而且我希望输出看起来像这样:
name city sentiment
harry london happy
harry london sad
harry london mad
sally london sad
sally london mad
sally london agitated
gary london sad
gary london low
gary london content
有人知道我该如何访问情感列中的数组以提取文本?
提前感谢。
首先使用您的示例数据创建一个数据框:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('explode_example').getOrCreate()
data = [
("harry", "london", [
{"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "happy"},
{"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "sad"},
{"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "mad"}
]),
("sally", "london", [
{"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"},
{"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "mad"},
{"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"}
]),
("gary", "london", [
{"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "excited"},
{"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "down"},
{"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "agitated"}
]),
("mary", "manchester", [
{"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "sad"},
{"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "low"},
{"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "content"}
]),
("gerry", "manchester", [
{"score": "0.999926", "sentiment": {"score": "-0.640237"}, "text": "ecstatic"},
{"score": "0.609836", "sentiment": {"score": "-0.607594"}, "text": "good"},
{"score": "0.58564", "sentiment": {"score": "-0.6833"}, "text": "bad"}
])
]
df = spark.createDataFrame(data=data, schema = ["name", "city", "sentiment"])
您拥有的是以下数据框:
df.show(truncate=False)
+-----+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|name |city |sentiment |
+-----+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|harry|london |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> happy], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> sad], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> mad]] |
|sally|london |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> mad], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]] |
|gary |london |[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> excited], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> down], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated]]|
|mary |manchester|[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> low], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> content]] |
|gerry|manchester|[[sentiment -> {score=-0.640237}, score -> 0.999926, text -> ecstatic], [sentiment -> {score=-0.607594}, score -> 0.609836, text -> good], [sentiment -> {score=-0.6833}, score -> 0.58564, text -> bad]] |
+-----+----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
一旦有了数据框,您就需要分解 sentiment
列:
from pyspark.sql.functions import explode
df_exp = df.select(df["name"], df["city"], explode(df["sentiment"]))
结果:
df_exp.show(truncate=False)
+-----+----------+---------------------------------------------------------------------+
|name |city |col |
+-----+----------+---------------------------------------------------------------------+
|harry|london |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> happy] |
|harry|london |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> sad] |
|harry|london |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> mad] |
|sally|london |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad] |
|sally|london |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> mad] |
|sally|london |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated] |
|gary |london |[sentiment -> {score=-0.640237}, score -> 0.999926, text -> excited] |
|gary |london |[sentiment -> {score=-0.607594}, score -> 0.609836, text -> down] |
|gary |london |[sentiment -> {score=-0.6833}, score -> 0.58564, text -> agitated] |
|mary |manchester|[sentiment -> {score=-0.640237}, score -> 0.999926, text -> sad] |
|mary |manchester|[sentiment -> {score=-0.607594}, score -> 0.609836, text -> low] |
|mary |manchester|[sentiment -> {score=-0.6833}, score -> 0.58564, text -> content] |
|gerry|manchester|[sentiment -> {score=-0.640237}, score -> 0.999926, text -> ecstatic]|
|gerry|manchester|[sentiment -> {score=-0.607594}, score -> 0.609836, text -> good] |
|gerry|manchester|[sentiment -> {score=-0.6833}, score -> 0.58564, text -> bad] |
+-----+----------+---------------------------------------------------------------------+
最后,让我们创建一个仅包含文本的列,并检索3个所需的列:
# Extract text
df_exp = df_exp.withColumn("text", df_exp["col"].text)
# Select result columns
result = df_exp.select("name", "city", "text")
结果将是:
result.show(truncate=False)
+-----+----------+--------+
|name |city |text |
+-----+----------+--------+
|harry|london |happy |
|harry|london |sad |
|harry|london |mad |
|sally|london |sad |
|sally|london |mad |
|sally|london |agitated|
|gary |london |excited |
|gary |london |down |
|gary |london |agitated|
|mary |manchester|sad |
|mary |manchester|low |
|mary |manchester|content |
|gerry|manchester|ecstatic|
|gerry|manchester|good |
|gerry|manchester|bad |
+-----+----------+--------+