如何在 Polars 中用矢量化解决方案替换“map_groups”?

问题描述 投票:0回答:1

我有一个函数

vol_buckets()
,它有一个内部函数
_vol_buckets_engine()
。我想找到一种更好的方法来执行相同的逻辑。

该函数使用的数据在数据框中有多个符号,我需要

low_vol
mid_vol
独立计算强制每个符号。我不想使用
map_groups()
函数,因为众所周知它很慢。
_vol_buckets_engine()
的目的是使用
low_vol
方法计算
mid_vol
列中每个符号的
realized_volatility
.quantile()
。然后创建一个 Polars 表达式
vol_bucket
,它创建一个列并用 high、med 或 low str 标记该行,以表示该行与同一交易品种的其他行相比的波动性桶。

我想使用表达式和 .over("symbol") 方法来实现此功能。您建议我如何实现每个符号的这一目标?

数据/表示/函数

数据

data = pl.read_csv(
b"""
symbol,date,realized_volatility
dummy,2023-01-04T00:00:00.000000000,1.050947170034001
dummy,2023-01-05T00:00:00.000000000,1.0930296529563293
dummy,2023-01-06T00:00:00.000000000,5.0831401853504135
dummy,2023-01-09T00:00:00.000000000,3.8529778077664285
dummy,2023-01-10T00:00:00.000000000,3.1224700809152663
dummy,2023-01-11T00:00:00.000000000,3.328913046103626
dummy,2023-01-12T00:00:00.000000000,2.8538745003208548
dummy,2023-01-13T00:00:00.000000000,2.6237053380435693
dummy,2023-01-17T00:00:00.000000000,2.416355871776205
dummy,2023-01-18T00:00:00.000000000,2.2033129305712387
dummy,2023-01-19T00:00:00.000000000,2.0031931513812293
dummy,2023-01-20T00:00:00.000000000,2.1377752983292035
dummy,2023-01-23T00:00:00.000000000,2.3889134269559436
dummy,2023-01-24T00:00:00.000000000,2.2898536571943926
dummy,2023-01-25T00:00:00.000000000,2.1522017746630864
dummy,2023-01-26T00:00:00.000000000,2.153211000843634
dummy,2023-01-27T00:00:00.000000000,2.1344909490471045
dummy,2023-01-30T00:00:00.000000000,2.2444631484198787
dummy,2023-01-31T00:00:00.000000000,2.1693660366593583
dummy,2023-02-01T00:00:00.000000000,2.091741023814303
dummy,2023-02-02T00:00:00.000000000,2.622032115043669
dummy,2023-02-03T00:00:00.000000000,2.767740741501858
dummy,2023-02-06T00:00:00.000000000,2.790631908078487
dummy,2023-02-07T00:00:00.000000000,2.825755005215503
dummy,2023-02-08T00:00:00.000000000,2.839487798967736
dummy,2023-02-09T00:00:00.000000000,2.748677337185615
dummy,2023-02-10T00:00:00.000000000,2.6491272220932562
dummy,2023-02-13T00:00:00.000000000,2.678068235650906
dummy,2023-02-14T00:00:00.000000000,2.591777646393414
dummy,2023-02-15T00:00:00.000000000,2.621357463850524
dummy,2023-02-16T00:00:00.000000000,2.6205239182151603
dummy,2023-02-17T00:00:00.000000000,2.1897201502231676
dummy,2023-02-21T00:00:00.000000000,2.4364850847077117
dummy,2023-02-22T00:00:00.000000000,2.4324879545270295
dummy,2023-02-23T00:00:00.000000000,2.2858654833211633
dummy,2023-02-24T00:00:00.000000000,2.399940087252254
dummy,2023-02-27T00:00:00.000000000,2.388152400576198
dummy,2023-02-28T00:00:00.000000000,2.3660255638521663
dummy,2023-03-01T00:00:00.000000000,2.427234380419688
dummy,2023-03-02T00:00:00.000000000,2.4330829447444504
dummy,2023-03-03T00:00:00.000000000,2.718003833450704
dummy,2023-03-06T00:00:00.000000000,2.6476401801968548
dummy,2023-03-07T00:00:00.000000000,2.6860748665098315
dummy,2023-03-08T00:00:00.000000000,2.7021875940028997
dummy,2023-03-09T00:00:00.000000000,2.7054264661805933
dummy,2023-03-10T00:00:00.000000000,2.709577864255094
dummy,2023-03-13T00:00:00.000000000,2.6276349342884733
dummy,2023-03-14T00:00:00.000000000,2.6670200150686627
dummy,2023-03-15T00:00:00.000000000,2.6481374152619575
dummy,2023-03-16T00:00:00.000000000,2.3098956144797222
dummy,2023-03-17T00:00:00.000000000,2.1191230404273558
dummy,2023-03-20T00:00:00.000000000,2.0864454171790796
dummy,2023-03-21T00:00:00.000000000,2.0097633983245315
dummy,2023-03-22T00:00:00.000000000,1.929233413402791
dummy,2023-03-23T00:00:00.000000000,1.9295479237168838
dummy,2023-03-24T00:00:00.000000000,1.9509333112340799
dummy,2023-03-27T00:00:00.000000000,1.8844254026075675
dummy,2023-03-28T00:00:00.000000000,1.8839124342657283
dummy,2023-03-29T00:00:00.000000000,1.951087131997698
dummy,2023-03-30T00:00:00.000000000,1.9461090376651762
dummy,2023-03-31T00:00:00.000000000,2.009117720680126
dummy,2023-04-03T00:00:00.000000000,1.7772642996364656
dummy,2023-04-04T00:00:00.000000000,1.7780422055031615
dummy,2023-04-05T00:00:00.000000000,1.8187341926990974
dummy,2023-04-06T00:00:00.000000000,1.7150029341723647
dummy,2023-04-10T00:00:00.000000000,1.781541227195909
dummy,2023-04-11T00:00:00.000000000,1.7974543772914964
dummy,2023-04-12T00:00:00.000000000,1.7330193896599626
dummy,2023-04-13T00:00:00.000000000,2.1147528127542894
dummy,2023-04-14T00:00:00.000000000,1.706628052331882
dummy,2023-04-17T00:00:00.000000000,1.5907012619921879
dummy,2023-04-18T00:00:00.000000000,1.5371944764215826
dummy,2023-04-19T00:00:00.000000000,1.5298371806238649
dummy,2023-04-20T00:00:00.000000000,1.463565758594709
dummy,2023-04-21T00:00:00.000000000,1.42981031912061
dummy,2023-04-24T00:00:00.000000000,1.37111916315023
dummy,2023-04-25T00:00:00.000000000,1.3344651780937993
dummy,2023-04-26T00:00:00.000000000,1.3320778436462895
dummy,2023-04-27T00:00:00.000000000,1.4850652068750114
dummy,2023-04-28T00:00:00.000000000,1.494247444270745
dummy,2023-05-01T00:00:00.000000000,1.4134287701935544
dummy,2023-05-02T00:00:00.000000000,1.3780025740294264
dummy,2023-05-03T00:00:00.000000000,1.3639257736538914
dummy,2023-05-04T00:00:00.000000000,1.3812923208247097
dummy,2023-05-05T00:00:00.000000000,2.0832373775225053
dummy,2023-05-08T00:00:00.000000000,2.0305097315900373
dummy,2023-05-09T00:00:00.000000000,2.059561989940993
dummy,2023-05-10T00:00:00.000000000,1.963606664476099
dummy,2023-05-11T00:00:00.000000000,1.9306937107888855
dummy,2023-05-12T00:00:00.000000000,1.8582842726888125
dummy,2023-05-15T00:00:00.000000000,1.8406958433531144
dummy,2023-05-16T00:00:00.000000000,1.8370033720288037
dummy,2023-05-17T00:00:00.000000000,1.7970463374858805
dummy,2023-05-18T00:00:00.000000000,1.8500710447283415
dummy,2023-05-19T00:00:00.000000000,1.7605430717260266
dummy,2023-05-22T00:00:00.000000000,1.7511317900402594
dummy,2023-05-23T00:00:00.000000000,1.824326390923716
dummy,2023-05-24T00:00:00.000000000,1.4375350280115315
dummy,2023-05-25T00:00:00.000000000,1.4514367372766908
dummy,2023-05-26T00:00:00.000000000,1.519056219066445
dummy,2023-05-30T00:00:00.000000000,1.5382320775472667
dummy,2023-05-31T00:00:00.000000000,1.5217430737640685
dummy,2023-06-01T00:00:00.000000000,1.5972064596652586
dummy,2023-06-02T00:00:00.000000000,1.5714237778785558
dummy,2023-06-05T00:00:00.000000000,1.5902709727458448
dummy,2023-06-06T00:00:00.000000000,1.560745761259242
dummy,2023-06-07T00:00:00.000000000,1.5818161453753425
dummy,2023-06-08T00:00:00.000000000,1.3922006051979485
dummy,2023-06-09T00:00:00.000000000,1.3744562490572472
dummy,2023-06-12T00:00:00.000000000,1.457457768146953
dummy,2023-06-13T00:00:00.000000000,1.4466466918178154
dummy,2023-06-14T00:00:00.000000000,1.4361967395694815
dummy,2023-06-15T00:00:00.000000000,1.4447040609797785
dummy,2023-06-16T00:00:00.000000000,0.7312811461557279
dummy,2023-06-20T00:00:00.000000000,0.7313067277718319
dummy,2023-06-21T00:00:00.000000000,0.7077559307772997
dummy,2023-06-22T00:00:00.000000000,0.7628610294768818
dummy,2023-06-23T00:00:00.000000000,0.7634005996581734
dummy,2023-06-26T00:00:00.000000000,0.7731223634233964
dummy,2023-06-27T00:00:00.000000000,0.8472880682212705
dummy,2023-06-28T00:00:00.000000000,0.8611165955077903
dummy,2023-06-29T00:00:00.000000000,0.8576752702726677
dummy,2023-06-30T00:00:00.000000000,0.9746587880908129
dummy,2023-07-03T00:00:00.000000000,0.9957492823420077
dummy,2023-07-05T00:00:00.000000000,0.997091096401469
dummy,2023-07-06T00:00:00.000000000,0.9194450450911804
dummy,2023-07-07T00:00:00.000000000,0.930719343608125
dummy,2023-07-10T00:00:00.000000000,0.9562659005884814
dummy,2023-07-11T00:00:00.000000000,0.8913934198663264
dummy,2023-07-12T00:00:00.000000000,0.8802192596228992
dummy,2023-07-13T00:00:00.000000000,0.8858906512045663
dummy,2023-07-14T00:00:00.000000000,0.7989043724135594
dummy,2023-07-17T00:00:00.000000000,0.892709516255626
dummy,2023-07-18T00:00:00.000000000,0.8732598147105561
dummy,2023-07-19T00:00:00.000000000,0.8888680151111504
dummy,2023-07-20T00:00:00.000000000,0.9032313961664961
dummy,2023-07-21T00:00:00.000000000,0.8350230361133806
dummy,2023-07-24T00:00:00.000000000,0.8394175070677063
dummy,2023-07-25T00:00:00.000000000,0.7633667480526932
dummy,2023-07-26T00:00:00.000000000,0.7679755142359856
dummy,2023-07-27T00:00:00.000000000,0.77867043188138
dummy,2023-07-28T00:00:00.000000000,0.7980766465969299
dummy,2023-07-31T00:00:00.000000000,0.789544454683515
dummy,2023-08-01T00:00:00.000000000,0.7958438502081496
dummy,2023-08-02T00:00:00.000000000,0.8689450706760995
dummy,2023-08-03T00:00:00.000000000,0.7953045457007718
dummy,2023-08-04T00:00:00.000000000,1.6305627295436527
dummy,2023-08-07T00:00:00.000000000,1.7145261459249155
dummy,2023-08-08T00:00:00.000000000,1.6469841461269166
dummy,2023-08-09T00:00:00.000000000,1.6609267664512968
dummy,2023-08-10T00:00:00.000000000,1.6604023328434614
dummy,2023-08-11T00:00:00.000000000,1.4799646239001687
dummy,2023-08-14T00:00:00.000000000,1.4890337553207493
dummy,2023-08-15T00:00:00.000000000,1.5211673127036998
dummy,2023-08-16T00:00:00.000000000,1.5275524984305846
dummy,2023-08-17T00:00:00.000000000,1.5893564019504338
dummy,2023-08-18T00:00:00.000000000,1.5510882876065588
dummy,2023-08-21T00:00:00.000000000,1.5686214277927917
dummy,2023-08-22T00:00:00.000000000,1.562627674533575
dummy,2023-08-23T00:00:00.000000000,1.719484207173839
dummy,2023-08-24T00:00:00.000000000,1.9620113717020276
dummy,2023-08-25T00:00:00.000000000,1.9150329352516895
dummy,2023-08-28T00:00:00.000000000,1.9409548361807556
dummy,2023-08-29T00:00:00.000000000,2.084490846534584
dummy,2023-08-30T00:00:00.000000000,2.1733623171564806
dummy,2023-08-31T00:00:00.000000000,2.160783655361729
dummy,2023-09-01T00:00:00.000000000,2.1793908039242895
dummy,2023-09-05T00:00:00.000000000,2.1728974464825286
dummy,2023-09-06T00:00:00.000000000,2.623702781798936
dummy,2023-09-07T00:00:00.000000000,2.913094567383203
dummy,2023-09-08T00:00:00.000000000,2.8552143121957325
dummy,2023-09-11T00:00:00.000000000,2.8668430074980438
dummy,2023-09-12T00:00:00.000000000,2.963098823270275
dummy,2023-09-13T00:00:00.000000000,2.927779071384871
dummy,2023-09-14T00:00:00.000000000,2.935777535772875
dummy,2023-09-15T00:00:00.000000000,2.105568022944933
dummy,2023-09-18T00:00:00.000000000,2.098469650547303
dummy,2023-09-19T00:00:00.000000000,2.102079780396924
dummy,2023-09-20T00:00:00.000000000,2.2148747362445484
dummy,2023-09-21T00:00:00.000000000,2.2419794282631558
dummy,2023-09-22T00:00:00.000000000,2.2503723453558866
dummy,2023-09-25T00:00:00.000000000,2.2385277976944358
dummy,2023-09-26T00:00:00.000000000,2.3879303451915144
dummy,2023-09-27T00:00:00.000000000,2.4066890858542047
dummy,2023-09-28T00:00:00.000000000,2.3335569029453844
dummy,2023-09-29T00:00:00.000000000,2.3340229115643214
dummy,2023-10-02T00:00:00.000000000,2.3883181263038784
dummy,2023-10-03T00:00:00.000000000,2.387917427869889
dummy,2023-10-04T00:00:00.000000000,2.243742722599013
dummy,2023-10-05T00:00:00.000000000,2.018830331402512
dummy,2023-10-06T00:00:00.000000000,2.0378580009972933
dummy,2023-10-09T00:00:00.000000000,2.0358593883362905
dummy,2023-10-10T00:00:00.000000000,1.8792234707485946
dummy,2023-10-11T00:00:00.000000000,1.7760982896480833
dummy,2023-10-12T00:00:00.000000000,1.7844568491345048
dummy,2023-10-13T00:00:00.000000000,1.7965419947611168
dummy,2023-10-16T00:00:00.000000000,1.796169748957089
dummy,2023-10-17T00:00:00.000000000,1.3653644940881955
dummy,2023-10-18T00:00:00.000000000,1.0801524435487886
dummy,2023-10-19T00:00:00.000000000,1.0774790301569788
dummy,2023-10-20T00:00:00.000000000,1.1384378583294845
dummy,2023-10-23T00:00:00.000000000,1.0359672502955974
dummy,2023-10-24T00:00:00.000000000,0.9893772090316733
dummy,2023-10-25T00:00:00.000000000,1.026284941186689
dummy,2023-10-26T00:00:00.000000000,1.2343892632363178
dummy,2023-10-27T00:00:00.000000000,1.1590590748581153
dummy,2023-10-30T00:00:00.000000000,1.197746969681852
dummy,2023-10-31T00:00:00.000000000,1.059927036681978
dummy,2023-11-01T00:00:00.000000000,1.1510136339666301
dummy,2023-11-02T00:00:00.000000000,1.287186157131098
dummy,2023-11-03T00:00:00.000000000,1.2780869517628282
dummy,2023-11-06T00:00:00.000000000,1.157582764635238
dummy,2023-11-07T00:00:00.000000000,1.2010550560932904
dummy,2023-11-08T00:00:00.000000000,1.2121833942772782
dummy,2023-11-09T00:00:00.000000000,1.211370091724534
dummy,2023-11-10T00:00:00.000000000,1.3180877547834722
dummy,2023-11-13T00:00:00.000000000,1.3226912699516258
dummy,2023-11-14T00:00:00.000000000,1.3738703306181093
dummy,2023-11-15T00:00:00.000000000,1.3592276047578322
dummy,2023-11-16T00:00:00.000000000,1.3134962243578143
dummy,2023-11-17T00:00:00.000000000,1.2889516286387386
dummy,2023-11-20T00:00:00.000000000,1.3145299415093774
dummy,2023-11-21T00:00:00.000000000,1.299562910563539
dummy,2023-11-22T00:00:00.000000000,1.2949833792649645
dummy,2023-11-24T00:00:00.000000000,1.275358569306392
dummy,2023-11-27T00:00:00.000000000,1.2754859720628269
dummy,2023-11-28T00:00:00.000000000,1.252060080071354
dummy,2023-11-29T00:00:00.000000000,1.243135474551939
dummy,2023-11-30T00:00:00.000000000,1.2448304628906661
dummy,2023-12-01T00:00:00.000000000,1.1846066040077565
dummy,2023-12-04T00:00:00.000000000,1.2157022562986464
dummy,2023-12-05T00:00:00.000000000,1.3629221743116913
dummy,2023-12-06T00:00:00.000000000,1.310758369219744
dummy,2023-12-07T00:00:00.000000000,1.1318339210448705
dummy,2023-12-08T00:00:00.000000000,1.1291758347967926
dummy,2023-12-11T00:00:00.000000000,1.1358579911415723
dummy,2023-12-12T00:00:00.000000000,1.1546273760073806
dummy,2023-12-13T00:00:00.000000000,1.130638333145214
dummy,2023-12-14T00:00:00.000000000,0.9862317372106153
dummy,2023-12-15T00:00:00.000000000,0.9794541986256703
dummy,2023-12-18T00:00:00.000000000,0.9317936185771348
dummy,2023-12-19T00:00:00.000000000,0.8708750952091696
dummy,2023-12-20T00:00:00.000000000,0.8990540381695469
dummy,2023-12-21T00:00:00.000000000,0.8968653416191089
dummy,2023-12-22T00:00:00.000000000,0.7262661600265888
dummy,2023-12-26T00:00:00.000000000,0.7033713118234325
dummy,2023-12-27T00:00:00.000000000,0.6339041006593921
dummy,2023-12-28T00:00:00.000000000,0.632426477361776
dummy,2023-12-29T00:00:00.000000000,0.6146423229289452
"""
)

功能

def vol_buckets(
    data: pl.DataFrame | pl.LazyFrame,
    lo_quantile: float = 0.4,
    hi_quantile: float = 0.8,
    _column_name_volatility: str = "realized_volatility",
) -> pl.DataFrame:
  
    # Group by 'symbol' and apply 'calculate_vol_buckets' to each group
    if isinstance(data, pl.LazyFrame):
        data = data.collect()

    result = data.group_by("symbol").map_groups(
        lambda group_df: _vol_buckets_engine(
            group_df,
            lo_quantile,
            hi_quantile,
            _column_name_volatility,
        )
    )

    return result.lazy()

def _vol_buckets_engine(
    grouped_data: pl.DataFrame | pl.LazyFrame,
    lo_quantile: float,
    hi_quantile: float,
    _column_name_volatility: str,
) -> pl.LazyFrame:

    # Calculate low and high quantiles for the group
    low_vol = (
        grouped_data.lazy()
        .select(pl.col(_column_name_volatility).quantile(lo_quantile))
        .collect()
        .to_series()[0]
    )
    mid_vol = (
        grouped_data.lazy()
        .select(pl.col(_column_name_volatility).quantile(hi_quantile))
        .collect()
        .to_series()[0]
    )

    # Determine the volatility bucket for each row
    vol_bucket = (
        pl.when(pl.col(_column_name_volatility) <= low_vol)
        .then(pl.lit("low"))
        .when(pl.col(_column_name_volatility) <= mid_vol)
        .then(pl.lit("mid"))
        .otherwise(pl.lit("high"))
        .alias("vol_bucket")
    )

    return grouped_data.lazy().with_columns(vol_bucket).collect()

Reprex

result = vol_buckets(data=data, lo_quantile=0.3, hi_quantile=0.65)

尝试1

这对于制作DF然后加入似乎有点多余。我应该只添加一列

low/mid_vol
.over("symbol)
然后过滤吗?我将非常感谢任何想法!谢谢

import polars as pl

def vol_buckets2(
    data: pl.DataFrame | pl.LazyFrame,
    lo_quantile: float = 0.4,
    hi_quantile: float = 0.8,
    _column_name_volatility: str = "realized_volatility",
) -> pl.DataFrame | pl.LazyFrame:
    """
    Calculate volatility buckets for each symbol without using `map_groups`.
    This function segments the input data into three volatility categories based on the specified quantiles of the volatility column.
    """
    # Ensure data is a LazyFrame for efficient computation
    if isinstance(data, pl.DataFrame):
        data = data.lazy()

    # Calculate low and high quantiles for each symbol
    low_vol = data.select(
        [(pl.col(_column_name_volatility)
        .quantile(lo_quantile)
        .over("symbol")
        .alias("low_vol")), pl.col("symbol"), pl.col("date")
    ])
    high_vol = data.select(
        pl.col(_column_name_volatility)
        .quantile(hi_quantile)
        .over("symbol")
        .alias("high_vol"), pl.col("symbol"), pl.col("date")
    )

    # Join quantile calculations back to the original data
    data = data.join(low_vol, on=["symbol", "date"], how="left")
    data = data.join(high_vol, on=["symbol", "date"], how="left")

    # Determine the volatility bucket for each row using expressions
    vol_bucket = (
        pl.when(pl.col(_column_name_volatility) <= pl.col("low_vol"))
        .then(pl.lit("low"))
        .when(pl.col(_column_name_volatility) <= pl.col("high_vol"))
        .then(pl.lit("mid"))
        .otherwise(pl.lit("high"))
        .alias("vol_bucket")
    )

    # Add the volatility bucket column to the data
    data = data.with_columns(vol_bucket)

    # Drop the intermediate quantile columns
    data = data.drop(["low_vol", "high_vol"])

    return data.collect() if isinstance(data, pl.LazyFrame) else data

dataframe performance grouping vectorization python-polars
1个回答
1
投票

您可以使用 Polars 的表达式 API 和

pl.Expr.qcut
来实现相同的功能。

def vol_buckets_new(
    data: pl.DataFrame | pl.LazyFrame,
    lo_quantile: float = 0.4,
    hi_quantile: float = 0.8,
    _column_name_volatility: str = "realized_volatility",
) -> pl.DataFrame:

    if isinstance(data, pl.LazyFrame):
        data = data.collect()

    result = data.with_columns(
        pl.col(_column_name_volatility)
        .qcut([lo_quantile, hi_quantile], labels=["low", "mid", "high"])
        .over("symbol")
        .alias("vol_bucket")
    )

    return result.lazy()
© www.soinside.com 2019 - 2024. All rights reserved.