连接数据帧时如何在 pandas multiindex 中保留类别 dtype？

Question

我正在处理数据（在我的实际用例中，染色体名称，但在这里我使用了虚拟名称），我希望能够控制这些数据的排序顺序，这将是

MultiIndex

（也包含位置）的一部分在染色体内：我想按染色体排序我的数据，然后是位置）。

为了确保所需的排序顺序，索引中似乎可以有一个

Categorical

。但是，一旦我连接数据帧，dtype 就会从

MultiIndex

中丢失。

（在下面的例子中，“A”扮演我的染色体信息的角色，“B”扮演位置信息的角色。“C”是一些唯一的位点标识符。）

df1 = pd.DataFrame({
    "A": pd.Categorical(
        ["X9", "X9", "X10", "X10"],
        categories=["X8", "X9", "X10"], ordered=True),
    "B": [1, 2, 1, 2],
    "C": ["9_1", "9_2", "10_1", "10_2"],
    "1": [1, 2, 3, 4]}
).set_index(["A", "B", "C"])
print(df1.index.dtypes)

df2 = pd.DataFrame({
    "A": pd.Categorical(
        ["X8", "X8", "X10", "X10"],
        categories=["X8", "X9", "X10"], ordered=True),
    "B": [1, 2, 1, 2],
    "C": ["8_1", "8_2", "10_1", "10_2"],
    "2": [1, 2, 3, 4]}
).set_index(["A", "B", "C"])
print(df2.index.dtypes)

df = pd.concat([df1, df2], axis=1).sort_index()
print(df.index.dtypes)
print(df.to_string())

以上代码生成以下输出：

A    category
B       int64
C      object
dtype: object
A    category
B       int64
C      object
dtype: object
A    object
B     int64
C    object
dtype: object
              1    2
A   B C             
X10 1 10_1  3.0  3.0
    2 10_2  4.0  4.0
X8  1 8_1   NaN  1.0
    2 8_2   NaN  2.0
X9  1 9_1   1.0  NaN
    2 9_2   2.0  NaN

我们可以看到索引排序的串联数据帧在级别“A”上按字母顺序排序，这与 dtype 不再是分类的事实一致，但我希望“8”和“9”出现在“10”之前“，我不能只删除“X”并将这些名称转换为整数（请记住，这些应该是染色体名称，对于人类，我们可以有染色体“X”和“Y”） .

连接时如何保留索引数据类型？

连接数据帧时如何在 pandas multiindex 中保留类别 dtype？

问题描述投票：0回答：0

最新问题

连接数据帧时如何在 pandas multiindex 中保留类别 dtype？

问题描述 投票：0回答：0

最新问题

问题描述投票：0回答：0