我使用 Anaconda 在 Windows 10 上安装 Python 3.9 和 PySpark。我正在关注 this 教程 操纵
DataFrame
Column
。我已经大幅削减了
得出以下最小工作示例。它由 4
部分: (1) DataFrame 的数据; (2) DataFrame 的架构; (3)开始
Spark并创建DataFrame; (4) 创建新模式以重命名“name”
子字段。
# Data for DataFrame
dataDF = [(('James','Smith'),'1991-04-01'),
(('Michael',''),'2000-05-19'),
]
# Schema definition needed to create DataFrame. The first "name"
# field consists of 2 subfields for first and last name
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
schema = StructType([
StructField('name', StructType([
StructField('firstname', StringType(), True),
StructField('lastname', StringType(), True)
])),
StructField('dob', StringType(), True),
])
# Start a Spark session and create the DataFrame "df"
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
df = spark.createDataFrame(data = dataDF, schema = schema)
# Define a new schema for renaming the "name" subfields
schema2 = StructType([
StructField("fname",StringType()),
StructField("lname",StringType())])
from pyspark.sql.functions import col
df.select( col("name").cast(schema2), col("dob") ).printSchema()
我尝试查找
cast
做了什么。看起来不像Python
cast
因为它是 col("name")
的方法,属于 Column
类。
我不成功尝试通过添加来提取文档字符串
Spyder 的 ?
后缀为 cast
,通过 Column 对象访问 :
df.name.cast?
下面的附件中列出了大量错误。
既然(比如说)
df.select?
显示了所需的文档字符串,为什么
这对df.name.cast?
不起作用吗?更具体地说,什么时候可以
?
预计会起作用吗?
我注意到
df.name
和 df.name.cast
是已识别的对象
In [98]: df.name
Out[98]: Column<'name'>
In [99]: df.name.cast
Out[99]: <bound method Column.cast of Column<'name'>>
df.name.cast?
Traceback (most recent call last):
Cell In[97], line 1
get_ipython().run_line_magic('pinfo', 'df.name.cast')
File ~\anaconda3\envs\py39\lib\site-packages\IPython\core\interactiveshell.py:2414 in run_line_magic
result = fn(*args, **kwargs)
File ~\anaconda3\envs\py39\lib\site-packages\IPython\core\magics\namespace.py:58 in pinfo
self.shell._inspect('pinfo', oname, detail_level=detail_level,
File ~\anaconda3\envs\py39\lib\site-packages\IPython\core\interactiveshell.py:1795 in _inspect
pmethod(
File ~\anaconda3\envs\py39\lib\site-packages\IPython\core\oinspect.py:782 in pinfo
info_b: Bundle = self._get_info(
File ~\anaconda3\envs\py39\lib\site-packages\IPython\core\oinspect.py:738 in _get_info
info_dict = self.info(obj, oname=oname, info=info, detail_level=detail_level)
File ~\anaconda3\envs\py39\lib\site-packages\IPython\core\oinspect.py:838 in info
if info and info.parent and hasattr(info.parent, HOOK_NAME):
File ~\anaconda3\envs\py39\lib\site-packages\pyspark\sql\column.py:1369 in __nonzero__
raise ValueError(
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
?
对象的
cast
对我有用:column