想知道 pd.melt 是否支持熔化多个柱。我有下面的示例尝试将 value_vars 作为列表列表,但我收到错误:
ValueError: Location based indexing can only have [labels (MUST BE IN THE INDEX), slices of labels (BOTH endpoints included! Can be slices of integers if the index is integers), listlike of labels, boolean] types
使用熊猫0.23.1.
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover'],
'State': ['Texas', 'Texas', 'Alabama'],
'Name':['Aria', 'Penelope', 'Niko'],
'Mango':[4, 10, 90],
'Orange': [10, 8, 14],
'Watermelon':[40, 99, 43],
'Gin':[16, 200, 34],
'Vodka':[20, 33, 18]},
columns=['City', 'State', 'Name', 'Mango', 'Orange', 'Watermelon', 'Gin', 'Vodka'])
所需输出:
City State Fruit Pounds Drink Ounces
0 Houston Texas Mango 4 Gin 16.0
1 Austin Texas Mango 10 Gin 200.0
2 Hoover Alabama Mango 90 Gin 34.0
3 Houston Texas Orange 10 Vodka 20.0
4 Austin Texas Orange 8 Vodka 33.0
5 Hoover Alabama Orange 14 Vodka 18.0
6 Houston Texas Watermelon 40 nan NaN
7 Austin Texas Watermelon 99 nan NaN
8 Hoover Alabama Watermelon 43 nan NaN
我尝试过,但收到上述错误:
df.melt(id_vars=['City', 'State'],
value_vars=[['Mango', 'Orange', 'Watermelon'], ['Gin', 'Vodka']],var_name=['Fruit', 'Drink'],
value_name=['Pounds', 'Ounces'])
melt
,然后使用concat
,但因为重复的值会在cumcount
中为唯一的
triples
添加MultiIndex
:
df1 = df.melt(id_vars=['City', 'State'],
value_vars=['Mango', 'Orange', 'Watermelon'],
var_name='Fruit', value_name='Pounds')
df2 = df.melt(id_vars=['City', 'State'],
value_vars=['Gin', 'Vodka'],
var_name='Drink', value_name='Ounces')
df1 = df1.set_index(['City', 'State', df1.groupby(['City', 'State']).cumcount()])
df2 = df2.set_index(['City', 'State', df2.groupby(['City', 'State']).cumcount()])
df3 = (pd.concat([df1, df2],axis=1)
.sort_index(level=2)
.reset_index(level=2, drop=True)
.reset_index())
print (df3)
City State Fruit Pounds Drink Ounces
0 Austin Texas Mango 10 Gin 200.0
1 Hoover Alabama Mango 90 Gin 34.0
2 Houston Texas Mango 4 Gin 16.0
3 Austin Texas Orange 8 Vodka 33.0
4 Hoover Alabama Orange 14 Vodka 18.0
5 Houston Texas Orange 10 Vodka 20.0
6 Austin Texas Watermelon 99 NaN NaN
7 Hoover Alabama Watermelon 43 NaN NaN
8 Houston Texas Watermelon 40 NaN NaN
我今天再次遇到同样的问题,发现了这个问题,看到我已经投票了,并认为将其变成一件事可能会很好,因为这对我来说是一个反复出现的问题。
因此,我编写了一个
multi_melt
函数,该函数使用 jezarel 提出的方法,但适用于可迭代输入(Martin Petrov 使用的语法)。请注意,此版本“广播”标量输入:
from itertools import cycle
import pandas as pd
def is_scalar(obj):
if isinstance(obj, str):
return True
elif hasattr(obj, "__iter__"):
return False
else:
return True
def multi_melt(
df: pd.DataFrame,
id_vars=None,
value_vars=None,
var_name=None,
value_name="value",
col_level=None,
ignore_index=True,
) -> pd.DataFrame:
# Note: we don't broadcast value_vars ... that would seem unintuitive
value_vars = value_vars if not is_scalar(value_vars[0]) else [value_vars]
var_name = var_name if not is_scalar(var_name) else cycle([var_name])
value_name = value_name if not is_scalar(value_name) else cycle([value_name])
melted_dfs = [
(
df.melt(
id_vars,
*melt_args,
col_level,
ignore_index,
).pipe(lambda df: df.set_index([*id_vars, df.groupby(id_vars).cumcount()]))
)
for melt_args in zip(value_vars, var_name, value_name)
]
return (
pd.concat(melted_dfs, axis=1)
.sort_index(level=2)
.reset_index(level=2, drop=True)
.reset_index()
)
由于它不是 pandas API 的一部分,因此您必须
pipe
它,但否则它应该像接受可迭代的普通 melt
一样工作:
示例:
df = pd.DataFrame(
{
"City": ["Houston", "Austin", "Hoover"],
"State": ["Texas", "Texas", "Alabama"],
"Name": ["Aria", "Penelope", "Niko"],
"Mango": [4, 10, 90],
"Orange": [10, 8, 14],
"Watermelon": [40, 99, 43],
"Gin": [16, 200, 34],
"Vodka": [20, 33, 18],
},
columns=["City", "State", "Name", "Mango", "Orange", "Watermelon", "Gin", "Vodka"],
)
df.pipe(
multi_melt,
id_vars=["City", "State"],
value_vars=[["Mango", "Orange", "Watermelon"], ["Gin", "Vodka"]],
var_name=["Fruit", "Drink"],
value_name=["Pounds", "Ounces"],
)
结果:
City State Fruit Pounds Drink Ounces
0 Austin Texas Mango 10 Gin 200.0
1 Hoover Alabama Mango 90 Gin 34.0
2 Houston Texas Mango 4 Gin 16.0
3 Austin Texas Orange 8 Vodka 33.0
4 Hoover Alabama Orange 14 Vodka 18.0
5 Houston Texas Orange 10 Vodka 20.0
6 Austin Texas Watermelon 99 NaN NaN
7 Hoover Alabama Watermelon 43 NaN NaN
8 Houston Texas Watermelon 40 NaN NaN
单一熔体:
df.pipe(
multi_melt,
id_vars=["City", "State"],
value_vars=["Mango", "Orange", "Watermelon"],
var_name="Fruit",
value_name="Pounds",
)
City State Fruit Pounds
0 Austin Texas Mango 10
1 Hoover Alabama Mango 90
2 Houston Texas Mango 4
3 Austin Texas Orange 8
4 Hoover Alabama Orange 14
5 Houston Texas Orange 10
6 Austin Texas Watermelon 99
7 Hoover Alabama Watermelon 43
8 Houston Texas Watermelon 40
一个高效且高性能的选项是使用 pyjanitor 中的 pivot_longer,使用正则表达式列表,并依赖于列中的现有顺序(
Mango, Gin
、Orange, Vodka
、Watermelon
):
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(
index=["City", "State"],
column_names=slice("Mango", "Vodka"),
names_to=("Fruit", "Drink"),
values_to=("Pounds", "Ounces"),
names_pattern=[r"M|O|W", r"G|V"],
)
City State Fruit Pounds Drink Ounces
0 Houston Texas Mango 4 Gin 16.0
1 Austin Texas Mango 10 Gin 200.0
2 Hoover Alabama Mango 90 Gin 34.0
3 Houston Texas Orange 10 Vodka 20.0
4 Austin Texas Orange 8 Vodka 33.0
5 Hoover Alabama Orange 14 Vodka 18.0
6 Houston Texas Watermelon 40 NaN NaN
7 Austin Texas Watermelon 99 NaN NaN
8 Hoover Alabama Watermelon 43 NaN NaN