从计数字典中获取统计数据(q1,中位数q3)

问题描述 投票:0回答:1

我有一本这样的计数词典:

{1:2, 2:1, 3:1}

我需要由此计算 q1、中位数和 q3。对于奇数数组来说,这是非常简单的,但对于偶数情况,我似乎无法弄清楚。 我想在不使用任何库(如 numpy)的情况下完成此操作。

示例:

counts = {
            "4": 1,
            "1": 2,
            "5": 1
        }
results = {
            "q1": 1,
            "median": 2.5,
            "q3": 4,
        }

到目前为止,我已经有了一些类似的东西,但这并不能处理所有情况。

def get_ratings_stats(counts):
    """"This function will return min, q1, median, q3 and max value from list of ratings."""

    cumulative_sum = 0
    cumulative_dict = {}
    for key, value in sorted(counts.items()):
        cumulative_sum += value
        cumulative_dict[key] = cumulative_sum

    q1_index = math.floor(cumulative_sum * 0.25)
    q3_index = math.ceil(cumulative_sum * 0.75)
    median_index = cumulative_sum * 0.5

    q1, q3, median = None, None, None
    print('indexes: ', q1_index, median_index, q3_index)
    for key, sum in cumulative_dict.items():
        if not q1 and sum >= q1_index:
            q1 = key
        if not q3 and sum >= q3_index:
            q3 = key
        if not median and sum >= median_index:
            median = key
python dictionary count statistics quantile
1个回答
0
投票

OP的代码已经差不多完成了,只是最后部分有问题。公开不同的实现并测量不同的执行时间。

import math
import statistics as st # used for stats_with_stats & workbench


def stats_with_stats(data:dict):
    # flat the data
    f_table = []
    for v, freq in data.items():
        f_table.extend([v]*freq)
    return st.quantiles(f_table)


def stats_by_cards(data:dict):
    n = sum(data.values()) # total frequency

    q1_i = math.floor(n * 0.25)
    q2_i = n * 0.5
    q3_i = math.ceil(n * 0.75)

    qs = iter((q1_i, q2_i, q3_i))

    out_stats = []
    q = next(qs)
    cum_f = 0
    for v, freq in sorted(data.items()):
        cum_f_new = cum_f + freq
        if cum_f <= q < cum_f_new:
            out_stats.append(v)
            q = next(qs, None)
            if q is None:
                break
        cum_f = cum_f_new

    return out_stats


def stats_by_learner(data:dict):
    tmp_data = {}
    f_cum = 0
    for v, f in sorted(data.items()):
        f_cum_new = f_cum + f
        tmp_data[v] = (f_cum, f_cum_new) # <- pairs
        f_cum = f_cum_new

    q1_i = math.floor(f_cum * 0.25)
    q2_i = f_cum * 0.5
    q3_i = math.ceil(f_cum * 0.75)

    qs = iter((q1_i, q2_i, q3_i))

    out_stats = []
    q = next(qs)
    for v, (lower_freq, upper_freq) in tmp_data.items():
        if lower_freq <= q < upper_freq:
            out_stats.append(v)
            q = next(qs, None)
            if q is None:
                break

    return out_stats        

使用以下数据集计时

from collections import Counter
import random

# test with sample dataset
random.seed(123456) # for sake of "reproducibility"
dataset = Counter([random.randint(1, 100) for _ in range(100)])

输出

check outputs:
stats_by_learner    [23, 49, 75]
stats_by_cards      [23, 49, 75]
stats_with_stats    [23.0, 49.0, 75.0]

quartiles with "stats_by_learner"
times           [41.32080510599917, 36.5191725270015, 36.58397209500254, 36.66133224499936, 36.83490775700193]
mean            37.5840379460009
std             2.0922524181898896
quartiles with "stats_by_cards"
times           [27.217588879000687, 27.218666459000815, 29.070444919001602, 27.207161409998662, 31.960372033001477]
mean            28.53484674000065
std             2.076736386875994
quartiles with "stats_with_stats"
times           [81.97632466700088, 84.27796363499874, 90.61311744499835, 85.13804757300022, 82.74273506200188]
mean            84.94963767640002
std             3.401200405652809

关于四分位数定义的评论:四分位数的实现方式(如OP)可能一致:

check outputs (with 50 terms & seed=123456)
stats_by_learner    [13, 42, 66]
stats_by_cards      [13, 42, 66]
stats_with_stats    [12.75, 40.5, 65.25]
© www.soinside.com 2019 - 2024. All rights reserved.