不确定为什么我在 Snowflake 上使用 Python 进行逻辑回归时会遇到类型错误?

问题描述 投票:0回答:1

我正在使用 Python 在 Snowflake 上创建逻辑回归模型。我在本地 R 中做了相同的逻辑回归,但想将其转换到我的 Snowflake 数据仓库。我取得了一些成功,但我对 Python 的熟悉程度不如对 R 的熟悉程度。

我相信回归是拟合的并给出了一个模型。我不真的知道预测的概率是什么样的,但这确实是目前的次要问题。

我只想从 pandas DataFrame 返回雪花 DataFrame。我无法让它发生。

下面是我的代码片段。

import snowflake.snowpark as snowpark
import snowflake.snowpark.functions as F
from sklearn.linear_model import LogisticRegression
from snowflake.snowpark.functions import col
import pandas as pd

def main(session: snowpark.Session): 
#
# EVERYTHING BEFORE WHAT'S BELOW IS DATA TRANSFORMATION, ALL OF IT WORKS JUST FINE
# AS FAR AS I KNOW

# ind_cols and dep_cols are arrays of column names 
# defining which columns are independent variables and which are dependent.
# Here I split the sample into independent and dependent columns, 
# and use LogisticRegression from scikit-learn.

    X = full_sample[ind_cols].to_pandas()
    y = full_sample[dep_col].to_pandas()

# ret_df is the snowflake DataFrame that I'm interested in predicting probabilities for.
    ret_df_lm = ret_df[ind_cols].to_pandas()

    lm = LogisticRegression()

    lm.fit(X, y)

    y_pred = lm.predict_proba(ret_df_lm)

    y_final = session.table(y_pred)

    #retention_pred = lm.predict(ret_df)

    return y_final

当我尝试返回

y_final
时,我收到错误
TypeError: sequence item 0: expected str instance, numpy.ndarray found
。我一定错过了一些东西。我尝试过其他东西,比如雪花的
session.write_pandas()
,但我不确定这是否是我需要的。

如何让

y_final
成为雪花 DataFrame?

python machine-learning scikit-learn snowflake-cloud-data-platform logistic-regression
1个回答
0
投票

我通过以下观察修复了您的代码:

  • 我必须生成随机数据。
  • 最初的错误来自
    session.table(y_pred)
    需要输入字符串,而不是数据框。
  • 要返回 Snowpark DataFrame,您需要转换 Pandas DataFrame:
    return session.create_dataframe(y_final)
# The Snowpark package is required for Python Worksheets. 
# You can add more packages by selecting them using the Packages control and then importing them.

import snowflake.snowpark as snowpark
import snowflake.snowpark.functions as F
from sklearn.linear_model import LogisticRegression
from snowflake.snowpark.functions import col
import pandas as pd
import numpy as np


def main(session: snowpark.Session): 
    #X = full_sample[ind_cols].to_pandas()
    #y = full_sample[dep_col].to_pandas()

    # Number of samples and features
    n_samples = 100  # for example, 100 samples
    n_features = 5   # for example, 5 features
    
    # Generate random data for X
    np.random.seed(0)  # for reproducibility
    X_data = np.random.rand(n_samples, n_features)
    X = pd.DataFrame(X_data, columns=[f'feature_{i}' for i in range(n_features)])
    
    # Generate random binary data for y
    y_data = np.random.randint(2, size=n_samples)
    y = pd.DataFrame(y_data, columns=['target'])

    # ret_df is the snowflake DataFrame that I'm interested in predicting probabilities for.
    # ret_df_lm = ret_df[ind_cols].to_pandas()
    ret_df_data = np.random.rand(n_samples, n_features)
    ret_df = pd.DataFrame(ret_df_data, columns=[f'feature_{i}' for i in range(n_features)])

    lm = LogisticRegression()

    lm.fit(X, y)

    y_pred = lm.predict_proba(ret_df)

    # y_final = session.table(y_pred)

    #retention_pred = lm.predict(ret_df)
    y_final = pd.DataFrame(y_pred, columns=['Prob_0', 'Prob_1'])

    # return a Snowpark DataFrame instead of a Pandas one
    return session.create_dataframe(y_final)
© www.soinside.com 2019 - 2024. All rights reserved.