一种针对年龄分类数据的热门编码

问题描述 投票:0回答:1

当尝试使用一个热编码器实现以下类别的编码时,我得到了couldn't convert string to float错误。

['0-17', '55+', '26-35', '46-50', '51-55', '36-45', '18-25']
python-3.x sklearn-pandas one-hot-encoding
1个回答
0
投票

我做了一些真正快速的东西应该有效。你会看到我有一个非常令人讨厌的单线预备你的极限;但是,如果您只是将限制直接转换为正确的格式,将会容易得多。

从本质上讲,这只是遍历限制列表并与限制进行比较。如果数据样本小于限制,我们将该索引设为1并中断。

import random

# str_limits = ['0-17', '55+', '26-35', '46-50', '51-55', '36-45', '18-25']
#
# oneline conditioning for the limit string format
# limits = sorted(list(filter(lambda x: not x.endswith("+"), map(lambda v: v.split("-")[-1], str_limits))))
# limits.append('1000')

# do this instead
limits = sorted([17, 35, 50, 55, 45, 25, 1000])

# sample 100 random datapoints between 0 and 65 for testing
samples = [random.choice(list(range(65))) for i in range(100)]

onehot = []  # this is where we will store our one-hot encodings
for sample in samples:
    row = [0]*len(limits)  # preallocating a list
    for i, limit in enumerate(limits):
        if sample <= limit:
            row[i] = 1
            break

    # storing that sample's onehot into a onehot list of lists
    onehot.append(row)

for i in range(10):
    print("{}: {}".format(onehot[i], samples[i]))

我不确定您的实现的具体细节,但您可能忘记在某个时刻从字符串转换为整数。

© www.soinside.com 2019 - 2024. All rights reserved.