计算可变窗口大小的斜率，其大小取决于行中包含的值（Pandas/Python）

Question

非常感谢您提前提供的任何建议。我正在尝试计算数据集中窗口的斜率，但我需要窗口大小是可变的并且取决于数据集中包含的值。为了最好地证明我的意思，请使用我的数据集 (df)，其中包含遗传数据并由三列组成：“Scaffold”、“centiMorgan”和“Position”。真实数据集中有数百行数据，centiMorgan 和 Position 列是连续变量。仅用几行表示的玩具数据集：

df
Scaffold    centiMorgan    Position
Scaffold01     0.0           10004
Scaffold01     0.1           10006 
Scaffold01     0.5           10008
Scaffold02     1.5           10450
Scaffold02     2.9           11100
Scaffold02     3.0           11102
Scaffold03     3.8           12600
Scaffold04     4.6           12610

我想计算此数据集中滑动窗口的斜率，使得该窗口对应于“位置”列彼此之间在一定数值距离（设定阈值）内的所有行。

例如，在上面的玩具数据集的情况下，我需要能够计算“位置”列中的第一个条目与“位置”列中的第二个条目之间的距离。如果这个值（位置之间的“距离”）大于我设置的阈值（例如 20），则计算该窗口的斜率并将其作为列表返回（然后将其写入文本文件）。

例如，使用设定的阈值 20，我会尝试使用 for 循环逐行计算位置距离，直到有 >20 的中断，将厘摩根的值和第一行的位置存储在窗口，然后将窗口中的最后一个 centiMorgan 和最后一个位置存储到同一个列表中。

但是，如果 Position 值之间的“距离”小于设置的阈值，则窗口必须继续循环文件中的行，直到“距离”值超过设置的阈值。此时，存在于该阈值窗口内的所有经过的行都将被视为窗口，计算出的斜率并将其附加到正在编译的斜率列表中。

因此，在这个玩具数据集的示例中，给定设置的阈值窗口值为 20，前三行将是一个窗口，因为位置值（10004、10006 和 10008）都存在于彼此之间不到 20 个计数的范围内。因此，将从窗口中的第一个点和最后一个点计算斜率。

Scaffold    centiMorgan    Position
Scaffold01     0.0          10004
Scaffold01     0.1          10006 
Scaffold01     0.5          10008

为了计算坡度，我们使用以下公式：

m = (Y2 - Y1 / X2 - X1 )

而 centiMorgan 列对应于 Y，位置对应于 X，窗口中的第一行是第一个 X1 和 Y1 值，窗口中的最后一行包含 X2 和 Y2 列。

但是，在 Scaffold01 位置 10008 和 Scaffold02 位置 10450 之间的断点处，因为此差异超出了我们的阈值 20，所以我们必须开始一个新窗口，然后再次循环遍历各行，直到下一个 >20 的断点。因此，以下窗口将是：

Scaffold02     1.5           10450

Scaffold02     2.9           11100
Scaffold02     3.0           11102

Scaffold03     3.8           12600
Scaffold04     4.6           12610

我第一次尝试进行此分析是通过 pandas 方法，使用 .diff 计算位置值之间的距离，然后使用随后创建的列中的值对窗口进行子集化：

import pandas as pd

df = pd.dataframe()

threshold = 20


# Subset position column:
posCol = ['position']

# create a new column 'Dist' in original df with results of .diff
# use .fillna(0) to fill in 0 where there is no previous row

df['Dist'] = posCol.diff().fillna(0)

# use a nested for/while loop to read through every line in df[Dist], the result column from .diff

for dist in df[Dist]:

     while dist < threshold 
        # write entire row to a new dataframe called window1
        # extract the first line of window1 
        # extract the last line of window1 
        # calculate the slope of window1 using the first and last points
        if dist > 20:
             break 
   
# Move on to the next window and repeat.

我仍在思考这个分析，还没有运行任何代码，但想检查一下我的方法是否有意义或者是最佳的。非常感谢任何建议或反馈，无论是 pandas 还是 python。谢谢你。

Answer 1

IIUC 您可以使用

itertools.groupby

来完成任务：

from functools import cmp_to_key
from itertools import groupby


def key_fn(a, b, threshold=20):
    a, b = a[1], b[1]
    return (b - a) > threshold


def calculate_slope(group):
    group = list(group)
    if len(group) == 1:
        return [None]
    else:
        (Y1, X1), *_, (Y2, X2) = group
        return [(Y2 - Y1) / (X2 - X1)] * len(group)


df[["group", "slopes"]] = [
    (i, v)
    for i, (_, g) in enumerate(
        groupby(zip(df["centiMorgan"], df["Position"]), key=cmp_to_key(key_fn))
    )
    for v in calculate_slope(g)
]

print(df)

打印：

     Scaffold  centiMorgan  Position  group  slopes
0  Scaffold01          0.0     10004    0.0   0.125
1  Scaffold01          0.1     10006    0.0   0.125
2  Scaffold01          0.5     10008    0.0   0.125
3  Scaffold02          1.5     10450    1.0     NaN
4  Scaffold02          2.9     11100    2.0   0.050
5  Scaffold02          3.0     11102    2.0   0.050
6  Scaffold03          3.8     12600    3.0   0.080
7  Scaffold04          4.6     12610    3.0   0.080

计算可变窗口大小的斜率，其大小取决于行中包含的值（Pandas/Python）

问题描述投票：0回答：1

1个回答

最新问题

计算可变窗口大小的斜率，其大小取决于行中包含的值（Pandas/Python）

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1