我正在开发一个项目,需要在 Anaconda 中使用 SnowPark 并合并以下软件包中
fuzzywuzzy
的功能:https://repo.anaconda.com/pkgs/snowflake/.
为了提供上下文,我使用以下代码从 Snowflake 加载一个虚构的表:
df = snowflake_session.sql(query_sql)
该表的格式如下:
----------------------
| "col1" | "col2" |
----------------------
| 1111111 | 1111112 |
| 2222222 | 2222222 |
| 3333333 | 1333243 |
----------------------
注意:
的类型为:df
我做的第一件事就是在我的 Anaconda 虚拟环境中安装该软件包:
!conda install --name snowflake_env -c https://repo.anaconda.com/pkgs/snowflake fuzzywuzzy
现在,我导入
fuzzywuzzy
库并在我的函数中使用它,我定义如下:
import fuzzywuzzy as fuzz
@udf(name="fuzzy", is_permanent=False, replace=True, packages=['fuzzywuzzy'])
def fuzzy(x: int, y:int) -> int:
return fuzz.ratio(x, y)
我将该函数应用于我的数据框的列:
df.select("col1", "col2", fuzzy("col2", "col2")).show()
但是,执行此代码时,出现以下错误:
SnowparkSQLException: (1304): 01b05970-0303-0e9b-0000-77590c4bc49a: 100357 (P0000): Python Interpreter Error:
Traceback (most recent call last):
File "_udf_code.py", line 37, in compute
File "_udf_code.py", line 26, in wrapper
File "C:\Users\es_oriol\AppData\Local\Temp\ipykernel_1234\660708773.py", line 5, in fuzzy
NameError: name 'fuzzywuzzy' is not defined
in function FUZZY with handler compute
这让我相信在我的
fuzzywuzzy
函数中导入 fuzzy
存在问题,对吗?有人对我如何解决这个问题有任何想法吗?
我感谢您提前提供的任何帮助!
谢谢!
奥里奥尔
我尝试过重新安装软件包,将其注册到snowflake,更改anaconda虚拟环境等......
我遵循的以下步骤对我有用。
我使用 conda 创建了一个虚拟环境,例如:
conda create --name snowflake_env --override-channels -c https://repo.anaconda.com/pkgs/snowflake python=3.9 snowflake-snowpark-python fuzzywuzzy
已安装的软件包列表是:
$ pip list
Package Version
-------------------------- ------------
asn1crypto 1.5.1
Brotli 1.0.9
certifi 2023.7.22
cffi 1.15.1
charset-normalizer 2.0.4
cloudpickle 2.0.0
cryptography 41.0.3
filelock 3.9.0
fuzzywuzzy 0.18.0
idna 3.4
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
numpy 1.26.0
oscrypto 1.3.0
packaging 23.1
pip 23.3
platformdirs 3.8.1
pyarrow 10.0.1
pycparser 2.21
pycryptodomex 3.15.0
PyJWT 2.4.0
pyOpenSSL 23.2.0
PySocks 1.7.1
python-Levenshtein 0.12.2
pytz 2023.3.post1
PyYAML 6.0.1
requests 2.31.0
setuptools 68.0.0
snowflake-connector-python 3.2.0
snowflake-snowpark-python 1.9.0
sortedcontainers 2.4.0
tomlkit 0.11.1
typing_extensions 4.7.1
urllib3 1.26.18
wheel 0.41.2
我创建了一个小的Python脚本来使用与你相同的udf:
$ cat stackoverflow.py
from snowflake.snowpark import Session
from snowflake.snowpark import Table
from snowflake.snowpark.functions import col
from snowflake.snowpark.functions import udf
from fuzzywuzzy import fuzz
connection_parameters = {
"account": "XXXX",
"user": "XXXX",
"password": "XXXX",
"role": "XXXX",
"warehouse": "XXXX",
"database": "XXXX",
"schema": "public"
}
session = Session.builder.configs(connection_parameters).create()
df = session.sql("SELECT * FROM test_fuzz")
#df = session.table('test_fuzz')
@udf(name="fuzzy", is_permanent=False, replace=True, packages=['fuzzywuzzy'])
def fuzzy(x: int, y:int) -> int:
return fuzz.ratio(x, y)
df.select("col1", "col2", fuzzy("col2", "col2")).show()
session.close()
我运行它:
$ python stackoverflow.py
---------------------------------------------------
|"COL1" |"COL2" |"FUZZY(""COL2"", ""COL2"")" |
---------------------------------------------------
|1111111 |1111112 |100 |
|2222222 |2222222 |100 |
|3333333 |1333243 |100 |
---------------------------------------------------
现在,当我通过 Snowflake UI 检查查询历史记录时,我可以看到创建了一个临时函数,并且我可以看到这部分:
# The following comment contains the source code generated by snowpark-python for explanatory purposes.
# import fuzzywuzzy.fuzz as fuzz
# @udf(name="fuzzy", is_permanent=False, replace=True, packages=['fuzzywuzzy'])
# def fuzzy(x: int, y:int) -> int:
# return fuzz.ratio(x, y)
#
# func = fuzzy
在您的情况下,上面的代码缺少导入部分:
# import fuzzywuzzy.fuzz as fuzz
我只能想到一些与你当地环境有关的东西。附带说明一下,我已经在 Ubuntu 22.04 和 Windows 10 上进行了测试,步骤完全相同,并且对我来说效果都很好。