Genres
列中的字符串是标记列表。为了能够使用该数据,我建议将它们转化为因素,即为每个标签创建一个单独的列,以指示该标签适用于哪些行。您可以这样做:
import pandas as pd
# small subset of your data for demonstration
df = pd.DataFrame({'Name': ['Sudoku', 'Reversi', 'Morocco'],
'Genres': ['Games, Strategy, Puzzle',
'Games, Strategy, Board',
'Games, Board, Strategy']})
display(df)
Name Genres
0 Sudoku Games, Strategy, Puzzle
1 Reversi Games, Strategy, Board
2 Morocco Games, Board, Strategy
# split each of the strings into a list
df['Genres'] = df['Genres'].str.split(pat=',')
# collect all unique tags from those lists
tags = set(df['Genres'].explode().values)
# create a new Boolean column for each tag
for tag in tags:
df[tag] = [tag in df['Genres'].loc[i] for i in df.index]
display(df)
Name Genres Board Games Puzzle Strategy
0 Sudoku [Games, Strategy, Puzzle] False True True True
1 Reversi [Games, Strategy, Board] True True False True
2 Morocco [Games, Board, Strategy] True True False True
请注意,此代码并未针对速度进行优化。我只是想展示如何做到。
您可以做一个
df['Genre'].str.split(",", n=1, expand=True)
根据您自己的选择输入n的值,它将分成那么多“,”然后选择所需的列