我无法根据另一列从一个 pandas 数据框列中删除确切的短语

问题描述 投票:0回答:1

我试图将列中每个单元格中相连的两个团队名称分开。我希望得到一些帮助来想出一种将它们分开的方法。

从下面的代码中可以看到,我正在从网站导入数据并清理数据框。

我想要实现的是创建一个新列,即从

df_games_2023['text_only_away']
列中剥离的
df_games_2023['text_only']
列。所以新列
df_games_2023 ['new_text_column']
将是“Mississippi Valley St.”、“Brescia”、“Pacific”等..

import pandas as pd
import re

# URL of the CSV file on the website
url = "https://www.barttorvik.com/2023_results.csv"  # Replace with the actual URL

#name the columns columns
colnames = ['matchup', 'date','home_team', 'xyz', 'xyz1','away_score','home_score','xyz2','xyz3','xyz4','xyz5']

# Read CSV data into pandas DataFrame
df_games_2023 = pd.read_csv(url, names = colnames)

#eliminate columns from dataframe
df_games_2023 = df_games_2023[['matchup','date', 'home_team', 'away_score','home_score']]

#name the dataframe columns
#df_games_2023.columns = ['matchup', 'date','home_team', 'xyz', 'xyz1','away_score','home_score','xyz2','xyz3','xyz4','xyz5']

#clean up the home_team data
# Extract only text using regex
df_games_2023['text_only'] = df_games_2023['home_team'].apply(lambda x: re.sub(r'\d+', '', x))

# Define the phrases to drop
phrases_to_drop = [',','-','.','(',')','%']

# Drop the specified phrases from the column
for phrase in phrases_to_drop:
    df_games_2023['text_only'] = df_games_2023['text_only'].str.replace(phrase, '', regex=True)
    
#Clean up away team
df_games_2023['text_only_away'] = df_games_2023['matchup'].apply(lambda x: re.sub(r'\d+', '', x))
#we are removing a random '-' with this string of code
df_games_2023['text_only_away'] = df_games_2023['text_only_away'].apply(lambda x: x.rstrip('-'))

# Now you have your DataFrame ready
df_games_2023

上面的代码工作得很好,但问题是当我尝试使用逻辑将一个团队名称与

df_games_2023['text_only_away']
列隔离开时。以下是我用来通过从 ['text_only_away'] 中剥离 ['text_only'] 创建新列的代码:

def remove_data(row):
    text_column = row['text_only_away']
    phrase = row['text_only']
    if text_column.endswith(phrase):
        return text_column[:-len(phrase)].rstrip()
    else:
        return text_column

# Apply the function to each row and create a new column
df_games_2023['new_text_column'] = df_games_2023.apply(remove_data, axis=1)

有关如何与 ['text_only'] 中列出的团队之外的团队创建新列的任何帮助都会非常有帮助。预先感谢您!

我希望在 df_games_2023 = pd.DataFrame({'new_text_column': ['Mississippi Valley St.', 'Brescia', 'Pacific', etc..]) 上有一个新专栏

pandas dataframe substring
1个回答
0
投票

您没有说明失败的实际症状,但是从每个字符串末尾删除标点符号的代码抛出了异常,因为“(”本身作为正则表达式是非法的。

无论如何,这是一种更紧凑的正则表达式语法,可以一次性完成所有替换:

df_games_2023['text_only'].str.replace(r'[-,\.()%]', '', regex=True)
                                                         ^^^^^^^^^^

注意我们必须覆盖默认值

regex=False

© www.soinside.com 2019 - 2024. All rights reserved.