我正在使用 Python 在 Snowflake 上创建逻辑回归模型。我在本地 R 中做了相同的逻辑回归,但想将其转换到我的 Snowflake 数据仓库。我取得了一些成功,但我对 Python 的熟悉程度不如对 R 的熟悉程度。
我相信回归是拟合的并给出了一个模型。我不真的知道预测的概率是什么样的,但这确实是目前的次要问题。
我只想从 pandas DataFrame 返回雪花 DataFrame。我无法让它发生。
下面是我的代码片段。
import snowflake.snowpark as snowpark
import snowflake.snowpark.functions as F
from sklearn.linear_model import LogisticRegression
from snowflake.snowpark.functions import col
import pandas as pd
def main(session: snowpark.Session):
#
# EVERYTHING BEFORE WHAT'S BELOW IS DATA TRANSFORMATION, ALL OF IT WORKS JUST FINE
# AS FAR AS I KNOW
# ind_cols and dep_cols are arrays of column names
# defining which columns are independent variables and which are dependent.
# Here I split the sample into independent and dependent columns,
# and use LogisticRegression from scikit-learn.
X = full_sample[ind_cols].to_pandas()
y = full_sample[dep_col].to_pandas()
# ret_df is the snowflake DataFrame that I'm interested in predicting probabilities for.
ret_df_lm = ret_df[ind_cols].to_pandas()
lm = LogisticRegression()
lm.fit(X, y)
y_pred = lm.predict_proba(ret_df_lm)
y_final = session.table(y_pred)
#retention_pred = lm.predict(ret_df)
return y_final
当我尝试返回
y_final
时,我收到错误 TypeError: sequence item 0: expected str instance, numpy.ndarray found
。我一定错过了一些东西。我尝试过其他东西,比如雪花的session.write_pandas()
,但我不确定这是否是我需要的。
如何让
y_final
成为雪花 DataFrame?
我通过以下观察修复了您的代码:
session.table(y_pred)
需要输入字符串,而不是数据框。return session.create_dataframe(y_final)
。# The Snowpark package is required for Python Worksheets.
# You can add more packages by selecting them using the Packages control and then importing them.
import snowflake.snowpark as snowpark
import snowflake.snowpark.functions as F
from sklearn.linear_model import LogisticRegression
from snowflake.snowpark.functions import col
import pandas as pd
import numpy as np
def main(session: snowpark.Session):
#X = full_sample[ind_cols].to_pandas()
#y = full_sample[dep_col].to_pandas()
# Number of samples and features
n_samples = 100 # for example, 100 samples
n_features = 5 # for example, 5 features
# Generate random data for X
np.random.seed(0) # for reproducibility
X_data = np.random.rand(n_samples, n_features)
X = pd.DataFrame(X_data, columns=[f'feature_{i}' for i in range(n_features)])
# Generate random binary data for y
y_data = np.random.randint(2, size=n_samples)
y = pd.DataFrame(y_data, columns=['target'])
# ret_df is the snowflake DataFrame that I'm interested in predicting probabilities for.
# ret_df_lm = ret_df[ind_cols].to_pandas()
ret_df_data = np.random.rand(n_samples, n_features)
ret_df = pd.DataFrame(ret_df_data, columns=[f'feature_{i}' for i in range(n_features)])
lm = LogisticRegression()
lm.fit(X, y)
y_pred = lm.predict_proba(ret_df)
# y_final = session.table(y_pred)
#retention_pred = lm.predict(ret_df)
y_final = pd.DataFrame(y_pred, columns=['Prob_0', 'Prob_1'])
# return a Snowpark DataFrame instead of a Pandas one
return session.create_dataframe(y_final)