而不是使用pyspark获取多个表,我们如何使用jdbc执行连接查询

问题描述 投票:2回答:1

customer - c_id,c_name,c_address product - p_id,p_name,price supplier - s_id,s_name,s_address orders - o_id,c_id,p_id,quantity,time

SELECT o.o_id,
       c.c_id,
       c.c_name,
       p.p_id,
       p.p_name,
       p.price * o.quantity AS amount
FROM customer c
JOIN orders o ON o.c_id = c.c_id
JOIN product p ON p.p_id = o.p_id;

我想执行上面的查询,而不是在pyspark中将3个表作为单独的数据帧提取并在数据帧上执行连接。

pyspark pyspark-sql
1个回答
3
投票

您可以使用表的查询,如下所述

参考PySpark Documentation

df = spark.read.jdbc(
        "url", "(query) as table", 
        properties={"user":"username", "password":"password"})

在你的情况下,它将是:

df = spark.read.jdbc("url", """
    (
        SELECT o.o_id,
            c.c_id,
            c.c_name,
            p.p_id,
            p.p_name,
            p.price * o.quantity AS amount
            FROM customer c
            JOIN orders o ON o.c_id = c.c_id
            JOIN product p ON p.p_id = o.p_id
    ) as table""", properties={"user":"username", "password":"password"})

这个answer使用这种类型的查询代替表。此question也适用于您的情况

© www.soinside.com 2019 - 2024. All rights reserved.