从pandas转换为numpy时如何保留列名

Question

根据to this post，我应该能够访问ndarray中列的名称作为a.dtype.names

但是，如果我将pandas DataFrame转换为带有df.as_matrix（）或df.values的ndarray，则dtype.names字段为None。此外，如果我尝试将列名称分配给ndarray

X = pd.DataFrame(dict(age=[40., 50., 60.], sys_blood_pressure=[140.,150.,160.]))
print X
print type(X.as_matrix())# <type 'numpy.ndarray'>
print type(X.as_matrix()[0]) # <type 'numpy.ndarray'>

m = X.as_matrix()
m.dtype.names = list(X.columns)

我明白了

ValueError: there are no fields defined

更新：

我特别感兴趣的是矩阵只需要保存一个类型（它是一个特定数字类型的ndarray），因为我也想使用cython进行优化。（我怀疑numpy记录和结构化数组更难以处理，因为它们更自由地输入。）

实际上，我只想维护通过sci-kit预测器深层树的数组的column_name元数据。它的接口的.fit（X，y）和.predict（X）API不允许传递关于X和y对象之外的列标签的附加元数据。

Answer 1

还有更多将pandas.DataFrame转换为numpy.array同时保留标签/列名称的方法

这主要是为了演示如何设置dtype / column_dtypes，因为有时数据源迭代器的输出需要一些预规范化。

方法一按列插入预定高度的归零阵列，并且松散地基于Creating Structured Arrays指南，只有一点网络爬行出现

import numpy


def to_tensor(dataframe, columns = [], dtypes = {}):
    # Use all columns from data frame if none where listed when called
    if len(columns) <= 0:
        columns = dataframe.columns
    # Build list of dtypes to use, updating from any `dtypes` passed when called
    dtype_list = []
    for column in columns:
        if column not in dtypes.keys():
            dtype_list.append(dataframe[column].dtype)
        else:
            dtype_list.append(dtypes[column])
    # Build dictionary with lists of column names and formatting in the same order
    dtype_dict = {
        'names': columns,
        'formats': dtype_list
    }
    # Initialize _mostly_ empty nupy array with column names and formatting
    numpy_buffer = numpy.zeros(
        shape = len(dataframe),
        dtype = dtype_dict)
    # Insert values from dataframe columns into numpy labels
    for column in columns:
        numpy_buffer[column] = dataframe[column].to_numpy()
    # Return results of conversion
    return numpy_buffer

方法二基于user7138814的answer，并且可能更有效，因为它基本上是to_recordss可用的内置pandas.DataFrame方法的包装器

def to_tensor(dataframe, columns = [], dtypes = {}, index = False):
    to_records_kwargs = {'index': index}
    if not columns:  # Default to all `dataframe.columns`
        columns = dataframe.columns
    if dtypes:       # Pull in modifications only for dtypes listed in `columns`
        to_records_kwargs['column_dtypes'] = {}
        for column in dtypes.keys():
            if column in columns:
                to_records_kwargs['column_dtypes'].update({column: dtypes.get(column)})
    return dataframe[columns].to_records(**to_records_kwargs)

以上任何一个都可以做...

X = pandas.DataFrame(dict(age = [40., 50., 60.], sys_blood_pressure = [140., 150., 160.]))

# Example of overwriting dtype for a column
X_tensor = to_tensor(X, dtypes = {'age': 'int32'})

print("Ages -> {0}".format(X_tensor['age']))
print("SBPs -> {0}".format(X_tensor['sys_blood_pressure']))

......应该输出......

Ages -> array([40, 50, 60])
SBPs -> array([140., 150., 160.])

......并且X_tensor的完整转储应该如下所示。

array([(40, 140.), (50, 150.), (60, 160.)],
      dtype=[('age', '<i4'), ('sys_blood_pressure', '<f8')])

一些想法

虽然方法二可能比第一方法更有效，但方法一（有一些修改）可能更有用于将两个或更多pandas.DataFrames合并为一个numpy.array

Answer 2

考虑一下DF，如下所示：

X = pd.DataFrame(dict(one=['Strawberry', 'Fields', 'Forever'], two=[1,2,3]))
X

提供元组列表作为结构化数组的数据输入：

arr_ip = [tuple(i) for i in X.as_matrix()]

有序的字段名称列表：

dtyp = np.dtype(list(zip(X.dtypes.index, X.dtypes)))

在这里，X.dtypes.index为您提供了列名和X.dtypes，它是相应的dtypes，它们再次统一为元组列表，并作为输入提供给要构造的dtype元素。

arr = np.array(arr_ip, dtype=dtyp)

得到：

arr
# array([('Strawberry', 1), ('Fields', 2), ('Forever', 3)], 
#       dtype=[('one', 'O'), ('two', '<i8')])

和

arr.dtype.names
# ('one', 'two')

Answer 3

Pandas数据帧也有一个方便的to_records方法。演示：

X = pd.DataFrame(dict(age=[40., 50., 60.], 
                      sys_blood_pressure=[140.,150.,160.]))
m = X.to_records(index=False)
print repr(m)

返回：

rec.array([(40.0, 140.0), (50.0, 150.0), (60.0, 160.0)], 
          dtype=[('age', '<f8'), ('sys_blood_pressure', '<f8')])

这是一个"record array"，它是一个ndarray子类，允许使用属性进行字段访问，例如m.age除了m['age']。

您可以通过构建视图将其作为常规float数组传递给cython函数：

m_float = m.view(float).reshape(m.shape + (-1,))
print repr(m_float)

这使：

rec.array([[  40.,  140.],
           [  50.,  150.],
           [  60.,  160.]], 
          dtype=float64)

请注意，为了使其正常工作，原始Dataframe必须为每列都有一个float dtype。为了确保使用m = X.astype(float, copy=False).to_records(index=False)。

Answer 4

好的，我在这里倾斜：

class NDArrayWithColumns(np.ndarray):
    def __new__(cls, obj,  columns=None):
        obj = obj.view(cls)
        obj.columns = columns
        return obj

    def __array_finalize__(self, obj):
        if obj is None: return
        self.columns = getattr(obj, 'columns', None)

    @staticmethod
    def from_dataframe(df):
        cols = tuple(df.columns)
        arr = df.as_matrix(cols)
        return NDArrayWithColumns.from_array(arr,cols)

    @staticmethod
    def from_array(array,columns):
        if isinstance(array,NDArrayWithColumns):
            return array
        return NDArrayWithColumns(array,tuple(columns))

    def __str__(self):
        sup = np.ndarray.__str__(self)
        if self.columns:
            header = ", ".join(self.columns)
            header = "# " + header + "\n"
            return header+sup
        return sup

NAN = float("nan")
X = pd.DataFrame(dict(age=[40., NAN, 60.], sys_blood_pressure=[140.,150.,160.]))
arr = NDArrayWithColumns.from_dataframe(X)
print arr
print arr.columns
print arr.dtype

得到：

# age, sys_blood_pressure
[[  40.  140.]
 [  nan  150.]
 [  60.  160.]]
('age', 'sys_blood_pressure')
float64

并且还可以传递给期望ndarray [2，double_t]的类型cython函数。

更新：除了一些oddness when passing the type to ufuncs之外，这项工作非常好。

从pandas转换为numpy时如何保留列名

问题描述投票：4回答：4

4个回答

一些想法

最新问题

从pandas转换为numpy时如何保留列名

问题描述 投票：4回答：4

4个回答

一些想法

最新问题

问题描述投票：4回答：4