当列数据类型是列表时，如何过滤pandas数据帧

Question

我在过滤一个列上的pandas数据帧时遇到了一些麻烦（我们称之为column_1），其数据类型是列表。具体来说，我想只返回行，使column_1和另一个预定列表的交集不为空。但是，当我尝试将逻辑放在.where，function的参数中时，我总是会遇到错误。以下是我的尝试，返回错误。

试图测试列表中是否有单个元素： table[element in table['column_1']]返回错误... KeyError: False
尝试将列表与数据帧行中的所有列表进行比较： table[[349569] == table.column_1]返回错误Arrays were different lengths: 23041 vs 1

在测试两个列表的交集之前，我正试图让这两个中间步骤失效。

感谢您抽出宝贵时间阅读我的问题！

Answer 1

考虑一下pd.Series s

s = pd.Series([[1, 2, 3], list('abcd'), [9, 8, 3], ['a', 4]])
print(s)

0       [1, 2, 3]
1    [a, b, c, d]
2       [9, 8, 3]
3          [a, 4]
dtype: object

和测试列表test

test = ['b', 3, 4]

应用lambda函数将s的每个元素转换为集合和intersection与test

print(s.apply(lambda x: list(set(x).intersection(test))))

0    [3]
1    [b]
2    [3]
3    [4]
dtype: object

要将其用作遮罩，请使用bool而不是list

s.apply(lambda x: bool(set(x).intersection(test)))

0    True
1    True
2    True
3    True
dtype: bool

Answer 2

您好长期使用，您可以将整个工作流程包装在函数中，并将功能应用到您需要的地方。因为你没有放任何示例数据集。我正在采用示例数据集并对其进行解析。考虑到我有文本数据库。首先，我会在列表中找到#tags然后我将搜索我想要的#tags并过滤数据。

# find all the tags in the message
def find_hashtags(post_msg):
    combo = r'#\w+'
    rx = re.compile(combo)
    hash_tags = rx.findall(post_msg)
    return hash_tags


# find the requered match according to a tag list and return true or false
def match_tags(tag_list, htag_list):
    matched_items = bool(set(tag_list).intersection(htag_list))
    return matched_items


test_data = [{'text': 'Head nipid mõnusateks sõitudeks kitsastel tänavatel. #TipStop'},
 {'text': 'Homses Rooli Võimus uus #Peugeot208!\nVaata kindlasti.'},
 {'text': 'Soovitame ennast tulevikuks ette valmistada, electric car sest uus #PeugeotE208 on peagi kohal!  ⚡️⚡️\n#UnboringTheFuture'},
 {'text': "Aeg on täiesti uueks roadtrip'i kogemuseks! \nLase ennast üllatada - #Peugeot5008!"},
 {'text': 'Tõeline ikoon, mille stiil avaldab muljet läbi eco car, electric cars generatsioonide #Peugeot504!'}
]

test_df = pd.DataFrame(test_data)

# find all the hashtags
test_df["hashtags"] = test_df["text"].apply(lambda x: find_hashtags(x))

# the only hashtags we are interested
tag_search = ["#TipStop", "#Peugeot208"]

# match the tags in our list
test_df["tag_exist"] = test_df["hashtags"].apply(lambda x: match_tags(x, tag_search))

# filter the data
main_df = test_df[test_df.tag_exist]

当列数据类型是列表时，如何过滤pandas数据帧

问题描述投票：2回答：2

2个回答

最新问题

当列数据类型是列表时，如何过滤pandas数据帧

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2