我似乎无法从网页表格生成数据框

问题描述 投票:0回答:1

不确定问题出在哪里,但代码没有提供从网页检索的数据帧。我尝试分别运行代码,但没有生成数据帧。

这是我的第一个提取项目,我似乎无法识别问题。

这是代码:

import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime 

url = 'https://en.wikipedia.org/wiki/List_of_largest_banks#By_market_capitalization'
db_name = 'Banks.db'
table_name = 'Largest_banks'
csv_path = '/home/project/Largest_banks_data.csv'
log_file = '/home/project/code_log.txt'  
table_attribs = {'Bank name': 'Name', 'Market Cap (US$ Billion)': 'MC_USD_Billion'}

###  Task 2 - Extract process

def extract(url, table_attribs):
# Loading the webpage for scraping
html_page = requests.get(url).text

# Parse the HTML content of the webpage
data = BeautifulSoup(html_page, 'html.parser')

# Find the table with specified attributes
# Find the main table containing the relevant data
main_table = data.find('table', class_='wikitable sortable')

# Find the desired `tbody` elements within the main table
table_bodies = main_table.find_all('tbody', attrs=table_attribs)

# Extract data from each `tbody` element
extracted_data = []
for table_body in table_bodies:
    rows = table_body.find_all('tr')
    for row in rows:
        extracted_data.append([cell.text for cell in row.find_all('td')])

# Use pandas to create a DataFrame from the extracted data
df = pd.DataFrame(extracted_data, columns=list(table_attribs.values()))

return df

# Calling the extract function
df = extract(url, table_attribs)

if df is not None:
# Print the result DataFrame
    print(df)
else:
    print("Extraction failed.")
html debugging etl html-parser
1个回答
0
投票

您可以直接将页面读取到 pandas 中:

tables = pd.read_html(html_page)

这将加载 3 个数据框,对应于页面上的 3 个表。然后您可以分别打印(或其他)每个表格;例如

tables[0] 

将打印出第一个表格(“按市值”)。

© www.soinside.com 2019 - 2024. All rights reserved.