在熊猫中组合数据框

问题描述 投票:0回答:0

我目前有代码可以抓取 2 组信息所需的所有信息,并将它们输入到 2 个不同的数据框中。代码是:


daily_racecard_info = beautifulSoupText.find(class_='w-racecard-grid-container widget-content widget-content-no-padding')

# Finds main body for INDIVIDUAL racecards
individual_course_racecard = daily_racecard_info.find_all(class_='w-racecard-grid-meeting')

# Get each race info
race_results = daily_racecard_info.find_all("li", class_="w-racecard-grid-race-result")

# Create an empty list to store the data
data = []

# Loop over each race course
for item in individual_course_racecard:

    course_names = item.find(class_='w-racecard-grid-course clickable').text.strip()
    course_going = item.find(class_='w-racecard-grid-info w-racecard-grid-info-going').text.strip()

    # Removes word "going" from going
    goings_simple = course_going.split('Going')[-1].strip()

    data.append({
    "Date": date,
    "Course": course_names,
    "Going": goings_simple,
    })

df = pd.DataFrame(data)

# Create an empty list to store the data
data2 = []

# Loop over each race result
for race in race_results:

    # Find the a tag within the li
    a_tag = race.find("a")

    # Extract the distance from the second span
    time, race_type = a_tag.contents[1].text.split()

    # Extract the distance from the second span
    distance = a_tag.contents[3].text

    # Removes word "(Inner)" from distance
    distance_simple = distance.split('(Inner)')[-0].strip()

    # Extract the title from the third span
    title = a_tag.contents[5].text

    # Construct the full URL using urljoin
    full_url = urljoin(url, a_tag["href"])

    # Add the data to the list
    data2.append({
        "Time": time,
        "Race Type": race_type,
        "Distance": distance_simple,
        "Title": title,
        "URL": full_url
        })

# Convert the list of dicts to a pandas DataFrame
df2 = pd.DataFrame(data2)

# Joins dataframes in to one
data_frames2 = pd.concat([df ,df2],axis='columns')

print(data_frames2)

这个(缩短的)输出是:

Current Output

我需要做的是输出更像这样的东西(正确的课程附加到正确的时间等):

Wanted Output

我觉得它需要在第二个 for 循环(“for race in race_results”)中添加一条语句,然后根据时间提取“父”课程名称,但我无法理解它。

python pandas dataframe beautifulsoup html-lists
© www.soinside.com 2019 - 2024. All rights reserved.