在具有带有重复项的已排序数字索引的DataFrame中,创建现有列的移位版本和插值缺失值

问题描述 投票:1回答:2

有这样一个Pandas DataFrame df,带有一个带有可能重复值的有序数字索引(代表f.e.time或distance):

     a    b
  0  4.0  1.0
1.5  5.5  2.5
1.5  5.5  2.5
  2  6.0  3.0
4.5  8.5  5.5

我想创建一个列c,其列值为a,索引移位与原始索引匹配。当使用f.e.填写未获得赋值的原始索引值时,仍应考虑与原始索引不匹配的所有索引移位。线性插值。

例:

以0.5作为示例索引移位,列c将由列a构建,索引值为0,0.5,1.5,2,2.5,4.5和5,给出以下中间结果,其中缺失值标记为(i)

      c
  0  Nan(i)
0.5  4.0
1.5  4.75(i)
  2  5.5
2.5  6.0
4.5  7.25(i)
  5  8.5

应使用df中使用的原始索引对最终结果进行索引:

     a    b    c
  0  4.0  1.0  Nan(i)
1.5  5.5  2.5  4.75(i)
1.5  5.5  2.5  4.75(i)
  2  6.0  3.0  5.5
4.5  8.5  5.5  7.25(i)

如何获取重复索引的值存在一个问题,在此示例中选择了一个值,但平均值可能是更好的appraoch。

python pandas dataframe interpolation feature-selection
2个回答
0
投票

我想,这就是你试图实现的目标:

#define the shift value
index_shift = 0.5
#find values common to both indices before and after the shift
ind_intersect = df.index.intersection(df.index + index_shift)
#create new column
df["c"] = np.nan
#transfer values from column a to column c
df["c"][ind_intersect] = df["a"][ind_intersect - index_shift]

你当然可以使用除numpy NaN之外的其他值填充新列。


0
投票

This is my current approach在构造新列时考虑其中一个重复索引值。


import pandas as pd
import numpy as np


def create_shift(df, column, shift_value, method, name):
    """
    Create a new column based on an existing column with a given shift value. 
    The shifted column is indexed based on an existing index with the
    missing values interpolated using the given method.

    :param df:          DataFrame to create the shift in.
    :param column:      The column name.
    :param shift_value: The value to shift the existing column by.
    :param method:      The interpolation method.
    :param name:        The name used for the newly created column.
    """
    if column in df.columns:
        current_index = df.index
        # creating the shifted index with the 2 decimal point precision
        shift_index = [round(i + shift_value, 2) for i in current_index.values]
        shift_data = pd.Series(data=df[column].tolist(), index=shift_index)
        # removing possible duplicates
        shift_data = shift_data[~shift_data.index.duplicated(keep='first')]
        shift_index = shift_data.index
        missing_index = current_index.difference(shift_index)
        combined_index = pd.Index(np.append(shift_index, missing_index)).sort_values()
        combined_data = shift_data.reindex(combined_index)
        combined_data.interpolate(method=method, inplace=True)
        df[name] = combined_data
    else:
        print("[Warning] Cannot create shift {} for missing {} column...".format(name, column))


d1 = {'a': [4.0, 5.5, 5.5, 6.0, 8.5], 'b': [1.0, 2.5, 2.5, 3.0, 5.5]}
df1 = pd.DataFrame(data=d1, index=[0, 1.5, 1.5, 2, 4.5])
create_shift(df1, 'a', 0.5, 'linear', 'c')
print(df1)
© www.soinside.com 2019 - 2024. All rights reserved.