pyspark.pandas独特的功能不起作用

Question

我在 pyspark 中有一个简短的代码，我试图运行它，然后转换为 pyspark.pandas，但由于某种原因，我不断收到错误：

spark = create_spark_session() spark.sparkContext.setLogLevel("ERROR")  
mysql_connection = create_db_connection(spark, 'mysql', minor_database='data')  
mysql_query = """SELECT Id AS FixtureId, LeagueId, StartDate FROM fixtures WHERE LeagueId = 67 AND StartDate >= '2024-01-01' AND StartDate <= '2024-03-01'""" 
df_mysql = (mysql_connection.option('query', mysql_query).load())
df_pandas = df_mysql.to_pandas_on_spark()
tuple_fix_ids = df_pandas['FixtureId'].unique().to_numpy()

我得到的错误是：

Traceback (most recent call last):
  File "/opt/project/glue/main.py", line 43, in main
    tuple_fix_ids = df_pandas['FixtureId'].unique().to_numpy()
  File "/home/glue_user/.local/lib/python3.10/site-packages/pyspark/pandas/generic.py", line 585, in to_numpy
    return cast(np.ndarray, self._to_pandas().values)
  File "/home/glue_user/.local/lib/python3.10/site-packages/pyspark/pandas/series.py", line 1724, in _to_pandas
    return self._to_internal_pandas().copy()
  File "/home/glue_user/.local/lib/python3.10/site-packages/pyspark/pandas/series.py", line 7333, in _to_internal_pandas
    return self._psdf._internal.to_pandas_frame[self.name]
  File "/home/glue_user/.local/lib/python3.10/site-packages/pyspark/pandas/utils.py", line 600, in wrapped_lazy_property
    setattr(self, attr_name, fn(self))
  File "/home/glue_user/.local/lib/python3.10/site-packages/pyspark/pandas/internal.py", line 1115, in to_pandas_frame
    pdf = sdf.toPandas()
  File "/home/glue_user/.local/lib/python3.10/site-packages/pyspark/sql/pandas/conversion.py", line 212, in toPandas
    struct_in_pandas = jconf.pandasStructHandlingMode()
  File "/home/glue_user/.local/lib/python3.10/site-packages/py4j/java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
  File "/home/glue_user/.local/lib/python3.10/site-packages/pyspark/errors/exceptions/captured.py", line 179, in deco
    return f(*a, **kw)
  File "/home/glue_user/.local/lib/python3.10/site-packages/py4j/protocol.py", line 330, in get_return_value
    raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling o175.pandasStructHandlingMode. Trace:
py4j.Py4JException: Method pandasStructHandlingMode([]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    at py4j.Gateway.invoke(Gateway.java:274)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Thread.java:750)

我做错了什么？

我在 pyspark.pandas 中尝试了不同的功能 - 它们都失败了

Answer 1

看起来您正在尝试从列中获取唯一值，但处理输出时出现错误。使用

to_pandas_on_spark()

转换 DataFrame 后，直接在列上使用

.unique()

可能无法在 pyspark.pandas 中按预期工作。您可以首先在 DataFrame 上使用

.drop_duplicates()

，选择“FixtureId”列，然后根据需要将其转换为 NumPy 数组。确保检查列名称和方法是否在 pyspark.pandas 上下文中正确应用。

pyspark.pandas独特的功能不起作用

问题描述投票：0回答：1

1个回答

最新问题

pyspark.pandas独特的功能不起作用

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1