计算数据块中的百分位数

Question

任何人都可以帮忙告诉错误在哪里吗？我究竟做错了什么？（数据块）

即使来自 databricks www 的示例也不起作用，并产生如下相同的错误。有没有其他方法可以计算这个指标？

select

        customerid,
        yearid,
        monthid,
        sum(TotalSpendings) as TotalSpendings,
        sum(TotalQuantity) as TotalQuantity,
        count (distinct ticketid) as TotalTickets,
        AVG(AvgIndexesPerTicket) as AvgIndexesPerTicket,
        max (transactiondate) as DateOfLastVisit,
        count(distinct transactiondate) as TotalNumberOfVisits,
        AVG(TotalSpendings) as AverageTicket,
        sum(TotalQuantity)/count(distinct ticketid) as AvgQttyPerTicket,
        sum(TotalDiscount) as TotalDiscount,
        percentile_disc(0.25) WITHIN GROUP (ORDER BY TotalQuantity), 
        percentile_disc(0.50) WITHIN GROUP (ORDER BY TotalQuantity),
        percentile_disc(0.75) WITHIN GROUP (ORDER BY TotalQuantity) as PercentileQttyTicket_75,
        percentile_disc(0.90) WITHIN GROUP (ORDER BY TotalQuantity) as PercentileQttyTicket_90,
        
        percentile_disc(0.25) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_25,
        percentile_disc(0.50) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_50,
        percentile_disc(0.75) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_75,
        percentile_disc(0.90) WITHIN GROUP (ORDER BY TotalSpendings) as PercentileSpendingsTicket_90
        
from (

select
        a.customerid,
        a.ticketid,
        a.transactiondate,
        extract(year from a.transactiondate) as yearid,
        extract(month from a.transactiondate) as monthid,
        sum(positionvalue) as TotalSpendings,
        sum(quantity) as TotalQuantity,   
        count(distinct productindex)/count(distinct a.ticketid) as AvgIndexesPerTicket,
        sum(discountvalue) as TotalDiscount
        from default.TICKET_ITEM a
          

        where 1=1

        and a.transactiondate between '2022-10-01' and '2022-10-31'
        and a.transactiontype = 'S'
        and a.transactiontypeheader = 'S'
        and a.customerid in ('94861b2c83c54d03930af4585a3a325a')
        and length(a.customerid) > 10
        group by 1,2,3,4,5) DETAL
        
        group by 1,2,3"""

我仍然收到错误：

解析异常：输入'GROUP（'（第15行，位置43）没有可行的选择

Answer 1

尝试降低问题的复杂性，直到找出问题所在。除非我有 TICKET_ITEM 配置单元表，否则我无法尝试在我的环境中调试问题。很多时候我把一个复杂的查询分成几块。

首先，始终将数据放入模式（数据库）中进行管理。

%sql
create database STACK_OVER_FLOW

因此，您的表将被重新创建为 STACK_OVER_FLOW.TICKET_ITEM。

其次，将内部查询放入永久或临时视图中。下面的代码在新模式中创建一个永久视图。

%sql
create view STACK_OVER_FLOW.FILTERED_TICKET_ITEM as
select
    a.customerid,
    a.ticketid,
    a.transactiondate,
    extract(year from a.transactiondate) as yearid,
    extract(month from a.transactiondate) as monthid,
    sum(a.positionvalue) as TotalSpendings,
    sum(a.quantity) as TotalQuantity,   
    count(distinct a.productindex) / count(distinct a.ticketid) as AvgIndexesPerTicket,
    sum(discountvalue) as TotalDiscount
from 
    STACK_OVER_FLOW.TICKET_ITEM a
where 
    1=1
    and a.transactiondate between '2022-10-01' and '2022-10-31'
    and a.transactiontype = 'S'
    and a.transactiontypeheader = 'S'
    and a.customerid in ('94861b2c83c54d03930af4585a3a325a')
    and length(a.customerid) > 10
group by 
    customerid,
    ticketid,
    transactiondate,
    yearid,
    monthid

第三，总是 group by 或 order by name，而不是按位置。您可能会随着时间的推移进行现场排序。我确实注意到查询末尾有额外的“””，但这可能是一个错字。

此时您将知道内部查询是否在视图中正常工作，您可以专注于具有百分位数的外部查询。

在数据工程中，我看到当临时视图的数量很大时，spark 优化器会感到困惑。在这些情况下，中间视图可能必须作为一个步骤写入文件。然后您可以将该文件公开为视图并继续您的工程工作。

percentile_disc 是数据块分布的一部分。

https://docs.databricks.com/sql/language-manual/functions/percentile_disc.html

它不是开源发行版的核心功能。

https://spark.apache.org/docs/latest/api/sql/index.html#percentile

请在您降低复杂性后仍然无法找到您的问题后，在帖子中添加更多信息。

Answer 2

为什么要在 select 语句中选择 customerid？我相信您想查看每个月的总支出是多少，总交易额是多少，支出的第 25 个百分位数等？为此，您不应该按 customerid 对其进行分组，而是尝试在下面的百分位 percentile(TotalQuantity, 0.25) 作为 PercentileQty_25

计算数据块中的百分位数

问题描述投票：0回答：2

2个回答

最新问题

计算数据块中的百分位数

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2